Scrapy Cheatsheet

Type: Fast, high-level web crawling and scraping framework for Python — used to build custom spiders that extract links, emails, subdomains, and other data from target sites


Installation

# Via pip (recommended)
pip3 install scrapy

# Via apt
sudo apt install python3-scrapy

# Via conda
conda install -c conda-forge scrapy

# Verify
scrapy version

Basic Usage

# Quick one-off crawl from the shell (no project needed)
scrapy shell <url>
scrapy shell https://example.com

# Fetch a single page and dump to stdout
scrapy fetch https://example.com

# View the page as Scrapy sees it (opens in browser)
scrapy view https://example.com

# Run a standalone spider file
scrapy runspider myspider.py

Command-Line Tools

Command Description
scrapy startproject <name> Create a new project skeleton
scrapy genspider <name> <domain> Generate a new spider from template
scrapy crawl <spider> Run a spider inside a project
scrapy runspider <file.py> Run a self-contained spider file
scrapy shell <url> Interactive scraping shell (test selectors)
scrapy fetch <url> Download a page using Scrapy’s downloader
scrapy view <url> Open the fetched page in a browser
scrapy parse <url> --spider=<name> Parse a URL with a spider’s callback
scrapy list List available spiders in the project
scrapy settings --get <KEY> Print a settings value
scrapy bench Run a quick benchmark crawl

Common crawl Flags

Flag Description
-o <file> Output scraped items to file (.json, .jsonl, .csv, .xml)
-O <file> Same as -o but overwrites instead of appending
-a <name>=<value> Pass an argument to the spider (e.g. -a domain=example.com)
-s <KEY>=<value> Override a setting at runtime
-L <level> Log level (DEBUG, INFO, WARNING, ERROR)
--logfile <file> Write logs to a file
--nolog Disable logging
-t <format> Output format when not inferred from extension

Common Commands

# Create a project
scrapy startproject recon

# Generate a spider scoped to a domain
cd recon
scrapy genspider links example.com

# Run the spider and export results
scrapy crawl links -o results.json

# Run a standalone spider with output
scrapy runspider spider.py -o output.jsonl

# Pass arguments into a spider
scrapy crawl links -a domain=example.com -a depth=2

# Override settings on the fly (respect robots.txt off, set delay)
scrapy crawl links -s ROBOTSTXT_OBEY=False -s DOWNLOAD_DELAY=1

# Limit log noise
scrapy crawl links -L WARNING -o out.csv

# Test CSS / XPath selectors interactively
scrapy shell "https://example.com"

Inside the Scrapy Shell

# After: scrapy shell "https://example.com"

response.url                          # Current URL
response.status                       # HTTP status code
response.headers                      # Response headers

# CSS selectors
response.css('a::attr(href)').getall()        # All link hrefs
response.css('title::text').get()             # Page title

# XPath selectors
response.xpath('//a/@href').getall()          # All link hrefs
response.xpath('//img/@src').getall()         # All image sources

# Follow a link
fetch('https://example.com/about')

# Regex over the body (e.g. emails)
import re
re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', response.text)

Minimal Recon Spider

# save as spider.py, run with: scrapy runspider spider.py -o out.json
import scrapy
from urllib.parse import urlparse

class ReconSpider(scrapy.Spider):
    name = "recon"
    start_urls = ["https://example.com"]

    def parse(self, response):
        # Collect all links and follow same-domain ones
        for href in response.css('a::attr(href)').getall():
            yield {"link": response.urljoin(href)}
            if urlparse(response.urljoin(href)).netloc == urlparse(response.url).netloc:
                yield response.follow(href, callback=self.parse)

Useful Settings (settings.py / -s overrides)

Setting Description
ROBOTSTXT_OBEY Whether to honour robots.txt (default True)
DOWNLOAD_DELAY Seconds between requests (politeness / rate limiting)
CONCURRENT_REQUESTS Max simultaneous requests
DEPTH_LIMIT Max crawl depth (0 = unlimited)
USER_AGENT Custom User-Agent string
RETRY_TIMES Number of retries on failed requests
HTTPCACHE_ENABLED Cache responses locally to avoid re-fetching
AUTOTHROTTLE_ENABLED Auto-adjust delay based on server load
# Example: stealthier crawl
scrapy crawl recon \
  -s DOWNLOAD_DELAY=2 \
  -s CONCURRENT_REQUESTS=2 \
  -s AUTOTHROTTLE_ENABLED=True \
  -s USER_AGENT="Mozilla/5.0"

Notes

  • Active — Scrapy makes real HTTP requests to the target; only crawl in-scope assets.
  • Set ROBOTSTXT_OBEY=False only when authorised; by default Scrapy respects robots.txt.
  • Use DOWNLOAD_DELAY / AUTOTHROTTLE to avoid hammering targets and tripping WAFs.
  • Great base for custom recon crawlers — [[reconspider]] is a Scrapy-based spider built exactly for this.
  • Export to JSON/JSONL then post-process with jq to extract emails, subdomains, and links.