Scrapy Cheatsheet#
Type: Fast, high-level web crawling and scraping framework for Python — used to build custom spiders that extract links, emails, subdomains, and other data from target sites
Installation#
# Via pip (recommended)
pip3 install scrapy
# Via apt
sudo apt install python3-scrapy
# Via conda
conda install -c conda-forge scrapy
# Verify
scrapy version
Basic Usage#
# Quick one-off crawl from the shell (no project needed)
scrapy shell <url>
scrapy shell https://example.com
# Fetch a single page and dump to stdout
scrapy fetch https://example.com
# View the page as Scrapy sees it (opens in browser)
scrapy view https://example.com
# Run a standalone spider file
scrapy runspider myspider.py
| Command |
Description |
scrapy startproject <name> |
Create a new project skeleton |
scrapy genspider <name> <domain> |
Generate a new spider from template |
scrapy crawl <spider> |
Run a spider inside a project |
scrapy runspider <file.py> |
Run a self-contained spider file |
scrapy shell <url> |
Interactive scraping shell (test selectors) |
scrapy fetch <url> |
Download a page using Scrapy’s downloader |
scrapy view <url> |
Open the fetched page in a browser |
scrapy parse <url> --spider=<name> |
Parse a URL with a spider’s callback |
scrapy list |
List available spiders in the project |
scrapy settings --get <KEY> |
Print a settings value |
scrapy bench |
Run a quick benchmark crawl |
Common crawl Flags#
| Flag |
Description |
-o <file> |
Output scraped items to file (.json, .jsonl, .csv, .xml) |
-O <file> |
Same as -o but overwrites instead of appending |
-a <name>=<value> |
Pass an argument to the spider (e.g. -a domain=example.com) |
-s <KEY>=<value> |
Override a setting at runtime |
-L <level> |
Log level (DEBUG, INFO, WARNING, ERROR) |
--logfile <file> |
Write logs to a file |
--nolog |
Disable logging |
-t <format> |
Output format when not inferred from extension |
Common Commands#
# Create a project
scrapy startproject recon
# Generate a spider scoped to a domain
cd recon
scrapy genspider links example.com
# Run the spider and export results
scrapy crawl links -o results.json
# Run a standalone spider with output
scrapy runspider spider.py -o output.jsonl
# Pass arguments into a spider
scrapy crawl links -a domain=example.com -a depth=2
# Override settings on the fly (respect robots.txt off, set delay)
scrapy crawl links -s ROBOTSTXT_OBEY=False -s DOWNLOAD_DELAY=1
# Limit log noise
scrapy crawl links -L WARNING -o out.csv
# Test CSS / XPath selectors interactively
scrapy shell "https://example.com"
Inside the Scrapy Shell#
# After: scrapy shell "https://example.com"
response.url # Current URL
response.status # HTTP status code
response.headers # Response headers
# CSS selectors
response.css('a::attr(href)').getall() # All link hrefs
response.css('title::text').get() # Page title
# XPath selectors
response.xpath('//a/@href').getall() # All link hrefs
response.xpath('//img/@src').getall() # All image sources
# Follow a link
fetch('https://example.com/about')
# Regex over the body (e.g. emails)
import re
re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', response.text)
Minimal Recon Spider#
# save as spider.py, run with: scrapy runspider spider.py -o out.json
import scrapy
from urllib.parse import urlparse
class ReconSpider(scrapy.Spider):
name = "recon"
start_urls = ["https://example.com"]
def parse(self, response):
# Collect all links and follow same-domain ones
for href in response.css('a::attr(href)').getall():
yield {"link": response.urljoin(href)}
if urlparse(response.urljoin(href)).netloc == urlparse(response.url).netloc:
yield response.follow(href, callback=self.parse)
Useful Settings (settings.py / -s overrides)#
| Setting |
Description |
ROBOTSTXT_OBEY |
Whether to honour robots.txt (default True) |
DOWNLOAD_DELAY |
Seconds between requests (politeness / rate limiting) |
CONCURRENT_REQUESTS |
Max simultaneous requests |
DEPTH_LIMIT |
Max crawl depth (0 = unlimited) |
USER_AGENT |
Custom User-Agent string |
RETRY_TIMES |
Number of retries on failed requests |
HTTPCACHE_ENABLED |
Cache responses locally to avoid re-fetching |
AUTOTHROTTLE_ENABLED |
Auto-adjust delay based on server load |
# Example: stealthier crawl
scrapy crawl recon \
-s DOWNLOAD_DELAY=2 \
-s CONCURRENT_REQUESTS=2 \
-s AUTOTHROTTLE_ENABLED=True \
-s USER_AGENT="Mozilla/5.0"
Notes#
- Active — Scrapy makes real HTTP requests to the target; only crawl in-scope assets.
- Set
ROBOTSTXT_OBEY=False only when authorised; by default Scrapy respects robots.txt.
- Use
DOWNLOAD_DELAY / AUTOTHROTTLE to avoid hammering targets and tripping WAFs.
- Great base for custom recon crawlers — [[reconspider]] is a Scrapy-based spider built exactly for this.
- Export to JSON/JSONL then post-process with
jq to extract emails, subdomains, and links.