docs-scraper
CLI tool that scrapes documents from various sources into local PDF files using browser automation.
Installation
npm install -g docs-scraper
Quick start
Scrape any document URL to PDF:
docs-scraper scrape https://example.com/document
Returns local path: ~/.docs-scraper/output/1706123456-abc123.pdf
Basic scraping
Scrape with daemon (recommended, keeps browser warm):
docs-scraper scrape <url>
Scrape with named profile (for authenticated sites):
docs-scraper scrape <url> -p <profile-name>
Scrape with pre-filled data (e.g., email for DocSend):
docs-scraper scrape <url> -D email=[email protected]
Direct mode (single-shot, no daemon):
docs-scraper scrape <url> --no-daemon
Authentication workflow
When a document requires authentication (login, email verification, passcode):
Initial scrape returns a job ID: ```bash docs-scraper scrape https://docsend.com/view/xxx
Output: Scrape blocked
Job ID: abc123
Retry with data: ```bash docs-scraper update abc123 -D email=[email protected]
or with password
docs-scraper update abc123 -D email=[email protected] -D password=1234 ```
Profile management
Profiles store session cookies for authenticated sites.
docs-scraper profiles list # List saved profiles
docs-scraper profiles clear # Clear all profiles
docs-scraper scrape <url> -p myprofile # Use a profile
Daemon management
The daemon keeps browser instances warm for faster scraping.
docs-scraper daemon status # Check status
docs-scraper daemon start # Start manually
docs-scraper daemon stop # Stop daemon
Note: Daemon auto-starts when running scrape commands.
Cleanup
PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.
Manual cleanup:
docs-scraper cleanup # Delete all PDFs
docs-scraper cleanup --older-than 1h # Delete PDFs older than 1 hour
Job management
docs-scraper jobs list # List blocked jobs awaiting auth
Supported sources
Direct PDF links - Downloads PDF directly
Notion pages - Exports Notion page to PDF
DocSend documents - Handles DocSend viewer
LLM fallback - Uses Claude API for any other webpage
Scraper Reference
Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.
DirectPdfScraper
Handles: URLs ending in .pdf
Data fields: None (downloads directly)
Example:
docs-scraper scrape https://example.com/document.pdf
DocsendScraper
Handles: docsend.com/view/*, docsend.com/v/*, and subdomains (e.g., org-a.docsend.com)
URL patterns:
Documents:
https://docsend.com/view/{id}orhttps://docsend.com/v/{id}Folders:
https://docsend.com/view/s/{id}Subdomains:
https://{subdomain}.docsend.com/view/{id}
Data fields:
| Field | Type | Description |
|---|---|---|
email |
Email address for document access | |
password |
password | Passcode/password for protected documents |
name |
text | Your name (required for NDA-gated documents) |
Examples:
# Pre-fill email for DocSend
docs-scraper scrape https://docsend.com/view/abc123 -D email=[email protected]
# With password protection
docs-scraper scrape https://docsend.com/view/abc123 -D email=[email protected] -D password=secret123
# With NDA name requirement
docs-scraper scrape https://docsend.com/view/abc123 -D email=[email protected] -D name="John Doe"
# Retry blocked job
docs-scraper update abc123 -D email=[email protected] -D password=secret123
Notes:
DocSend may require any combination of email, password, and name
Folders are scraped as a table of contents PDF with document links
The scraper auto-checks NDA checkboxes when name is provided
NotionScraper
Handles: notion.so/*, *.notion.site/*
Data fields:
| Field | Type | Description |
|---|---|---|
email |
Notion account email | |
password |
password | Notion account password |
Examples:
# Public page (no auth needed)
docs-scraper scrape https://notion.so/Public-Page-abc123
# Private page with login
docs-scraper scrape https://notion.so/Private-Page-abc123 \
-D email=[email protected] -D password=mypassword
# Custom domain
docs-scraper scrape https://docs.company.notion.site/Page-abc123
Notes:
Public Notion pages don't require authentication
Toggle blocks are automatically expanded before PDF generation
Uses session profiles to persist login across scrapes
LlmFallbackScraper
Handles: Any URL not matched by other scrapers (automatic fallback)
Data fields: Dynamic - determined by Claude analyzing the page
The LLM scraper uses Claude to analyze the page HTML and detect:
Login forms (extracts field names dynamically)
Cookie banners (auto-dismisses)
Expandable content (auto-expands)
CAPTCHAs (reports as blocked)
Paywalls (reports as blocked)
Common dynamic fields:
| Field | Type | Description |
|---|---|---|
email |
Login email (if detected) | |
password |
password | Login password (if detected) |
username |
text | Username (if login uses username) |
Examples:
# Generic webpage (no auth)
docs-scraper scrape https://example.com/article
# Webpage requiring login
docs-scraper scrape https://members.example.com/article \
-D email=[email protected] -D password=secret
# When blocked, check the job for required fields
docs-scraper jobs list
# Then retry with the fields the scraper detected
docs-scraper update abc123 -D username=myuser -D password=secret
Notes:
Requires
ANTHROPIC_API_KEYenvironment variableField names are extracted from the page's actual form fields
Limited to 2 login attempts before failing
CAPTCHAs require manual intervention
Data field summary
| Scraper | password | name | Other | |
|---|---|---|---|---|
| DirectPdf | - | - | - | - |
| DocSend | ✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM Fallback | ✓* | ✓* | - | Dynamic* |
*Fields detected dynamically from page analysis
Environment setup (optional)
Only needed for LLM fallback scraper:
export ANTHROPIC_API_KEY=your_key
Optional browser settings:
export BROWSER_HEADLESS=true # Set false for debugging
Common patterns
Archive a Notion page:
docs-scraper scrape https://notion.so/My-Page-abc123
Download protected DocSend:
docs-scraper scrape https://docsend.com/view/xxx
# If blocked:
docs-scraper update <job-id> -D email=[email protected] -D password=1234
Batch scraping with profiles:
docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite
Output
Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf)
Blocked: Job ID + required credential types
Troubleshooting
Timeout:
docs-scraper daemon stop && docs-scraper daemon startAuth fails:
docs-scraper jobs listto check pending jobsDisk full:
docs-scraper cleanupto remove old PDFs