Web Scraping
Extract data with the lightest reliable method first.
Choose the approach
Use
web_fetchfor simple public pages when the needed content is already in HTML.Use
browserwhen the site is dynamic, needs clicking, infinite scroll, filters, tabs, or login/session state.Use
web_searchonly to discover candidate pages when the target URL is unknown.
Default workflow
Identify the target site and exact fields to collect.
Test one page first.
Decide the extraction method:
web_fetchfor readable article/listing textbrowser snapshotfor dynamic DOM inspection
Normalize the output into a stable schema.
If scraping multiple pages, avoid tight loops and serialize requests.
Deduplicate by URL or stable item id.
Save results in the workspace when the task is larger than a quick one-off.
Browser scraping pattern
Open the page.
Take a snapshot.
Interact only as needed: search, click filters, pagination, expand sections.
Re-snapshot after each meaningful state change.
Extract only the fields the user asked for.
Close tabs when finished.
Output guidance
Prefer one of these formats:
concise bullet summary
JSON array of objects
CSV/TSV when the user wants exportable rows
Use explicit keys, for example:
[
{
"title": "...",
"url": "...",
"source": "...",
"date": "...",
"summary": "..."
}
]
Reliability rules
Do not invent missing fields.
If a site blocks access, say so and switch sources when appropriate.
For news/results pages, capture source + title + link at minimum.
For large jobs, checkpoint partial results to a workspace file.
Prefer fewer larger writes over many tiny writes.
Cleanup
Close browser tabs opened for scraping.
If you create state/output files, store them under the workspace and name them clearly.