Web Crawling

Scrapy Cheatsheet Type: Fast, high-level web crawling and scraping framework for Python — used to build custom spiders that extract links, emails, subdomains, and other data from target sites Installation # Via pip (recommended) pip3 install scrapy # Via apt sudo apt install python3-scrapy # Via conda conda install -c conda-forge scrapy # Verify scrapy version Basic Usage # Quick one-off crawl from the shell (no project needed) scrapy shell <url> scrapy shell https://example.com # Fetch a single page and dump to stdout scrapy fetch https://example.com # View the page as Scrapy sees it (opens in browser) scrapy view https://example.com # Run a standalone spider file scrapy runspider myspider.py Command-Line Tools Command Description scrapy startproject <name> Create a new project skeleton scrapy genspider <name> <domain> Generate a new spider from template scrapy crawl <spider> Run a spider inside a project scrapy runspider <file.py> Run a self-contained spider file scrapy shell <url> Interactive scraping shell (test selectors) scrapy fetch <url> Download a page using Scrapy’s downloader scrapy view <url> Open the fetched page in a browser scrapy parse <url> --spider=<name> Parse a URL with a spider’s callback scrapy list List available spiders in the project scrapy settings --get <KEY> Print a settings value scrapy bench Run a quick benchmark crawl Common crawl Flags Flag Description -o <file> Output scraped items to file (.json, .jsonl, .csv, .xml) -O <file> Same as -o but overwrites instead of appending -a <name>=<value> Pass an argument to the spider (e.g. -a domain=example.com) -s <KEY>=<value> Override a setting at runtime -L <level> Log level (DEBUG, INFO, WARNING, ERROR) --logfile <file> Write logs to a file --nolog Disable logging -t <format> Output format when not inferred from extension Common Commands # Create a project scrapy startproject recon # Generate a spider scoped to a domain cd recon scrapy genspider links example.com # Run the spider and export results scrapy crawl links -o results.json # Run a standalone spider with output scrapy runspider spider.py -o output.jsonl # Pass arguments into a spider scrapy crawl links -a domain=example.com -a depth=2 # Override settings on the fly (respect robots.txt off, set delay) scrapy crawl links -s ROBOTSTXT_OBEY=False -s DOWNLOAD_DELAY=1 # Limit log noise scrapy crawl links -L WARNING -o out.csv # Test CSS / XPath selectors interactively scrapy shell "https://example.com" Inside the Scrapy Shell # After: scrapy shell "https://example.com" response.url # Current URL response.status # HTTP status code response.headers # Response headers # CSS selectors response.css('a::attr(href)').getall() # All link hrefs response.css('title::text').get() # Page title # XPath selectors response.xpath('//a/@href').getall() # All link hrefs response.xpath('//img/@src').getall() # All image sources # Follow a link fetch('https://example.com/about') # Regex over the body (e.g. emails) import re re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', response.text) Minimal Recon Spider # save as spider.py, run with: scrapy runspider spider.py -o out.json import scrapy from urllib.parse import urlparse class ReconSpider(scrapy.Spider): name = "recon" start_urls = ["https://example.com"] def parse(self, response): # Collect all links and follow same-domain ones for href in response.css('a::attr(href)').getall(): yield {"link": response.urljoin(href)} if urlparse(response.urljoin(href)).netloc == urlparse(response.url).netloc: yield response.follow(href, callback=self.parse) Useful Settings (settings.py / -s overrides) Setting Description ROBOTSTXT_OBEY Whether to honour robots.txt (default True) DOWNLOAD_DELAY Seconds between requests (politeness / rate limiting) CONCURRENT_REQUESTS Max simultaneous requests DEPTH_LIMIT Max crawl depth (0 = unlimited) USER_AGENT Custom User-Agent string RETRY_TIMES Number of retries on failed requests HTTPCACHE_ENABLED Cache responses locally to avoid re-fetching AUTOTHROTTLE_ENABLED Auto-adjust delay based on server load # Example: stealthier crawl scrapy crawl recon \ -s DOWNLOAD_DELAY=2 \ -s CONCURRENT_REQUESTS=2 \ -s AUTOTHROTTLE_ENABLED=True \ -s USER_AGENT="Mozilla/5.0" Notes Active — Scrapy makes real HTTP requests to the target; only crawl in-scope assets. Set ROBOTSTXT_OBEY=False only when authorised; by default Scrapy respects robots.txt. Use DOWNLOAD_DELAY / AUTOTHROTTLE to avoid hammering targets and tripping WAFs. Great base for custom recon crawlers — [[reconspider]] is a Scrapy-based spider built exactly for this. Export to JSON/JSONL then post-process with jq to extract emails, subdomains, and links.