Pre-Scrape Compliance Checklist
Before writing any scraping code:
robots.txt — Fetch
{domain}/robots.txt, check if target path is disallowed. If yes, stop.Terms of Service — Check
/terms,/tos,/legal. Explicit scraping prohibition = need permission.Data type — Public factual data (prices, listings) is safer. Personal data triggers GDPR/CCPA.
Authentication — Data behind login is off-limits without authorization. Never scrape protected content.
API available? — If site offers an API, use it. Always. Scraping when API exists often violates ToS.
Legal Boundaries
Public data, no login — Generally legal (hiQ v. LinkedIn 2022)
Bypassing barriers — CFAA violation risk (Van Buren v. US 2021)
Ignoring robots.txt — Gray area, often breaches ToS (Meta v. Bright Data 2024)
Personal data without consent — GDPR/CCPA violation
Republishing copyrighted content — Copyright infringement
Request Discipline
Rate limit: Minimum 2-3 seconds between requests. Faster = server strain = legal exposure.
User-Agent: Real browser string + contact email:
Mozilla/5.0 ... (contact: [email protected])Respect 429: Exponential backoff. Ignoring 429s shows intent to harm.
Session reuse: Keep connections open to reduce server load.
Data Handling
Strip PII immediately — Don't collect names, emails, phones unless legally justified.
No fingerprinting — Don't combine data to identify individuals indirectly.
Minimize storage — Cache only what you need, delete what you don't.
Audit trail — Log what, when, where. Evidence of good faith if challenged.
For code patterns and robots.txt parser, see code.md