Scrapy Cheatsheet

Type: Fast, high-level web crawling and scraping framework for Python — used to build custom spiders that extract links, emails, subdomains, and other data from target sites

Installation

# Via pip (recommended)
pip3 install scrapy

# Via apt
sudo apt install python3-scrapy

# Via conda
conda install -c conda-forge scrapy

# Verify
scrapy version

Basic Usage

# Quick one-off crawl from the shell (no project needed)
scrapy shell <url>
scrapy shell https://example.com

# Fetch a single page and dump to stdout
scrapy fetch https://example.com

# View the page as Scrapy sees it (opens in browser)
scrapy view https://example.com

# Run a standalone spider file
scrapy runspider myspider.py

Command-Line Tools

Command	Description
`scrapy startproject <name>`	Create a new project skeleton
`scrapy genspider <name> <domain>`	Generate a new spider from template
`scrapy crawl <spider>`	Run a spider inside a project
`scrapy runspider <file.py>`	Run a self-contained spider file
`scrapy shell <url>`	Interactive scraping shell (test selectors)
`scrapy fetch <url>`	Download a page using Scrapy’s downloader
`scrapy view <url>`	Open the fetched page in a browser
`scrapy parse <url> --spider=<name>`	Parse a URL with a spider’s callback
`scrapy list`	List available spiders in the project
`scrapy settings --get <KEY>`	Print a settings value
`scrapy bench`	Run a quick benchmark crawl

Common crawl Flags

Flag	Description
`-o <file>`	Output scraped items to file (`.json`, `.jsonl`, `.csv`, `.xml`)
`-O <file>`	Same as `-o` but overwrites instead of appending
`-a <name>=<value>`	Pass an argument to the spider (e.g. `-a domain=example.com`)
`-s <KEY>=<value>`	Override a setting at runtime
`-L <level>`	Log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`)
`--logfile <file>`	Write logs to a file
`--nolog`	Disable logging
`-t <format>`	Output format when not inferred from extension

Common Commands

# Create a project
scrapy startproject recon

# Generate a spider scoped to a domain
cd recon
scrapy genspider links example.com

# Run the spider and export results
scrapy crawl links -o results.json

# Run a standalone spider with output
scrapy runspider spider.py -o output.jsonl

# Pass arguments into a spider
scrapy crawl links -a domain=example.com -a depth=2

# Override settings on the fly (respect robots.txt off, set delay)
scrapy crawl links -s ROBOTSTXT_OBEY=False -s DOWNLOAD_DELAY=1

# Limit log noise
scrapy crawl links -L WARNING -o out.csv

# Test CSS / XPath selectors interactively
scrapy shell "https://example.com"

Inside the Scrapy Shell

# After: scrapy shell "https://example.com"

response.url                          # Current URL
response.status                       # HTTP status code
response.headers                      # Response headers

# CSS selectors
response.css('a::attr(href)').getall()        # All link hrefs
response.css('title::text').get()             # Page title

# XPath selectors
response.xpath('//a/@href').getall()          # All link hrefs
response.xpath('//img/@src').getall()         # All image sources

# Follow a link
fetch('https://example.com/about')

# Regex over the body (e.g. emails)
import re
re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', response.text)

Minimal Recon Spider

# save as spider.py, run with: scrapy runspider spider.py -o out.json
import scrapy
from urllib.parse import urlparse

class ReconSpider(scrapy.Spider):
    name = "recon"
    start_urls = ["https://example.com"]

    def parse(self, response):
        # Collect all links and follow same-domain ones
        for href in response.css('a::attr(href)').getall():
            yield {"link": response.urljoin(href)}
            if urlparse(response.urljoin(href)).netloc == urlparse(response.url).netloc:
                yield response.follow(href, callback=self.parse)

Useful Settings (settings.py / -s overrides)

Setting	Description
`ROBOTSTXT_OBEY`	Whether to honour robots.txt (default `True`)
`DOWNLOAD_DELAY`	Seconds between requests (politeness / rate limiting)
`CONCURRENT_REQUESTS`	Max simultaneous requests
`DEPTH_LIMIT`	Max crawl depth (0 = unlimited)
`USER_AGENT`	Custom User-Agent string
`RETRY_TIMES`	Number of retries on failed requests
`HTTPCACHE_ENABLED`	Cache responses locally to avoid re-fetching
`AUTOTHROTTLE_ENABLED`	Auto-adjust delay based on server load

# Example: stealthier crawl
scrapy crawl recon \
  -s DOWNLOAD_DELAY=2 \
  -s CONCURRENT_REQUESTS=2 \
  -s AUTOTHROTTLE_ENABLED=True \
  -s USER_AGENT="Mozilla/5.0"

Notes

Active — Scrapy makes real HTTP requests to the target; only crawl in-scope assets.
Set ROBOTSTXT_OBEY=False only when authorised; by default Scrapy respects robots.txt.
Use DOWNLOAD_DELAY / AUTOTHROTTLE to avoid hammering targets and tripping WAFs.
Great base for custom recon crawlers — [[reconspider]] is a Scrapy-based spider built exactly for this.
Export to JSON/JSONL then post-process with jq to extract emails, subdomains, and links.

Scrapy Cheatsheet#

Installation#

Basic Usage#

Command-Line Tools#

Common crawl Flags#

Common Commands#

Inside the Scrapy Shell#

Minimal Recon Spider#

Useful Settings (settings.py / -s overrides)#

Notes#

Scrapy Cheatsheet

Installation

Basic Usage

Command-Line Tools

Common crawl Flags

Common Commands

Inside the Scrapy Shell

Minimal Recon Spider

Useful Settings (settings.py / -s overrides)

Notes