ReconSpider Cheatsheet
Type: Custom Scrapy-based web crawler that maps a target site and harvests links, emails, subdomains, external hosts, images, files, and metadata into a single JSON report
Installation
# Download the spider (HTB Academy distribution)
wget -O ReconSpider.zip \
https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.06.zip
unzip ReconSpider.zip
# Requires Scrapy (and Python 3)
pip3 install scrapy
ReconSpider is a single Python script (
ReconSpider.py) built on top of [[scrapy]]. There is no system package — it runs directly withpython3.
Basic Usage
python3 ReconSpider.py <target_url>
python3 ReconSpider.py https://example.com
Results are written to results.json in the current directory.
What It Collects
| Field | Description |
|---|---|
emails |
Email addresses found in page content |
links |
Internal links discovered while crawling |
external_files |
Links to documents (PDF, DOCX, XLSX, etc.) |
js_files |
JavaScript files referenced by the site |
form_fields |
Input field names from HTML forms |
images |
Image URLs |
videos |
Video URLs |
audio |
Audio file URLs |
comments |
HTML comments left in the source |
Common Commands
# Crawl a target (writes results.json)
python3 ReconSpider.py https://example.com
# Pretty-print the full report
cat results.json | jq
# Extract just the emails
cat results.json | jq '.emails'
# Extract all discovered internal links
cat results.json | jq '.links[]'
# Pull out referenced JavaScript files (good for further analysis)
cat results.json | jq '.js_files[]'
# List any external documents (PDFs, office files, etc.)
cat results.json | jq '.external_files[]'
# Show HTML comments (may leak dev notes / credentials)
cat results.json | jq '.comments[]'
# Grab form field names (useful for fuzzing later)
cat results.json | jq '.form_fields[]'
Parsing Output with jq
# Count results per category
cat results.json | jq 'to_entries | map({key, count: (.value | length)})'
# Unique subdomains hidden inside the link list
cat results.json | jq -r '.links[]' \
| sed -E 's#https?://([^/]+)/.*#\1#' | sort -u
# Build a target list of live JS files to feed into other tools
cat results.json | jq -r '.js_files[]' > js_targets.txt
Typical Workflow
# 1. Crawl the target
python3 ReconSpider.py https://inlanefreight.com
# 2. Review the harvested data
cat results.json | jq
# 3. Pivot on findings:
# - emails -> phishing / OSINT / password spraying lists
# - js_files -> grep for API keys, endpoints, secrets
# - comments -> developer notes, hidden paths
# - form_fields -> input names for ffuf / parameter fuzzing
# - links -> extract subdomains, feed to httprobe / nmap
cat results.json | jq -r '.links[]' | sed -E 's#https?://([^/]+).*#\1#' \
| sort -u | httprobe
Notes
- Active — ReconSpider sends live requests and crawls the target; stay within authorised scope.
- Output is always
results.jsonin the working directory — rename it between runs to avoid overwriting. - Built on [[scrapy]]; for finer control (depth limits, delays, custom selectors) drive Scrapy directly.
- The
comments,js_files, andform_fieldsoutputs are the highest-value findings for follow-up testing. - Commonly featured in the HTB Academy Information Gathering – Web Edition module for site crawling.