ReconSpider Cheatsheet

Type: Custom Scrapy-based web crawler that maps a target site and harvests links, emails, subdomains, external hosts, images, files, and metadata into a single JSON report


Installation

# Download the spider (HTB Academy distribution)
wget -O ReconSpider.zip \
  https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.06.zip
unzip ReconSpider.zip

# Requires Scrapy (and Python 3)
pip3 install scrapy

ReconSpider is a single Python script (ReconSpider.py) built on top of [[scrapy]]. There is no system package — it runs directly with python3.


Basic Usage

python3 ReconSpider.py <target_url>
python3 ReconSpider.py https://example.com

Results are written to results.json in the current directory.


What It Collects

Field Description
emails Email addresses found in page content
links Internal links discovered while crawling
external_files Links to documents (PDF, DOCX, XLSX, etc.)
js_files JavaScript files referenced by the site
form_fields Input field names from HTML forms
images Image URLs
videos Video URLs
audio Audio file URLs
comments HTML comments left in the source

Common Commands

# Crawl a target (writes results.json)
python3 ReconSpider.py https://example.com

# Pretty-print the full report
cat results.json | jq

# Extract just the emails
cat results.json | jq '.emails'

# Extract all discovered internal links
cat results.json | jq '.links[]'

# Pull out referenced JavaScript files (good for further analysis)
cat results.json | jq '.js_files[]'

# List any external documents (PDFs, office files, etc.)
cat results.json | jq '.external_files[]'

# Show HTML comments (may leak dev notes / credentials)
cat results.json | jq '.comments[]'

# Grab form field names (useful for fuzzing later)
cat results.json | jq '.form_fields[]'

Parsing Output with jq

# Count results per category
cat results.json | jq 'to_entries | map({key, count: (.value | length)})'

# Unique subdomains hidden inside the link list
cat results.json | jq -r '.links[]' \
  | sed -E 's#https?://([^/]+)/.*#\1#' | sort -u

# Build a target list of live JS files to feed into other tools
cat results.json | jq -r '.js_files[]' > js_targets.txt

Typical Workflow

# 1. Crawl the target
python3 ReconSpider.py https://inlanefreight.com

# 2. Review the harvested data
cat results.json | jq

# 3. Pivot on findings:
#    - emails       -> phishing / OSINT / password spraying lists
#    - js_files     -> grep for API keys, endpoints, secrets
#    - comments     -> developer notes, hidden paths
#    - form_fields  -> input names for ffuf / parameter fuzzing
#    - links        -> extract subdomains, feed to httprobe / nmap
cat results.json | jq -r '.links[]' | sed -E 's#https?://([^/]+).*#\1#' \
  | sort -u | httprobe

Notes

  • Active — ReconSpider sends live requests and crawls the target; stay within authorised scope.
  • Output is always results.json in the working directory — rename it between runs to avoid overwriting.
  • Built on [[scrapy]]; for finer control (depth limits, delays, custom selectors) drive Scrapy directly.
  • The comments, js_files, and form_fields outputs are the highest-value findings for follow-up testing.
  • Commonly featured in the HTB Academy Information Gathering – Web Edition module for site crawling.