The Surface That Already Exists
Before you run a port scan or send a single HTTP request to a target, a significant portion of the attack surface is already documented in two public archives: CT logs and the Wayback Machine.
Certificate transparency logs record every TLS certificate ever issued, including
all the Subject Alternative Names. If api.internal.target.com ever had a cert,
it’s in crt.sh. The Wayback Machine’s CDX API exposes every URL it crawled for a
domain — paths, query parameters, and endpoints that may no longer be in the sitemap
but still respond to requests.
corpus-builder queries both sources and stores the results in a SQLite database
you can query across engagements.
Two Sources, One Store
CT log mining. The ct module queries crt.sh’s JSON API for each domain in
your list:
https://crt.sh/?q=%.target.com&output=json
Each entry contains a name_value field, which may be a single hostname or a
newline-separated block of SANs. The miner splits those blocks, normalizes to
lowercase, strips wildcards, and only keeps entries that are proper subdomains
of the queried domain — not sibling domains that happened to share a cert.
corpus-builder mine-ct \
--domains targets.txt \
--db corpus.db \
--rate-limit-ms 500
Wayback CDX mining. The Wayback Machine’s CDX API returns every archived URL
matching a wildcard. The wayback module queries:
https://web.archive.org/cdx/search/cdx?url=*.target.com/*
&output=json&fl=original&collapse=urlkey
&filter=statuscode:200&limit=50000
The collapse=urlkey parameter deduplicates by URL structure rather than literal
URL, so you don’t get 10,000 variations of the same paginated endpoint. The
filter=statuscode:200 restriction keeps the path list relevant — 404s from ten
years ago aren’t interesting.
Extracted paths go into the corpus store keyed by domain. A path like
/api/v1/users/export found in Wayback history for api.target.com will show
up when you query that host.
corpus-builder mine-wayback \
--domains targets.txt \
--db corpus.db \
--rate-limit-ms 2000
The Wayback CDX endpoint is slower and rate-limited more aggressively than crt.sh, hence the higher default interval.
The SQLite Store
Both sources write into the same Corpus struct backed by a local SQLite file.
The schema tracks three things: domain roots, subdomains with their discovery source
(ct_log or wayback), and paths keyed by domain.
corpus-builder stats --db corpus.db
# Corpus: 3 domains, 247 subdomains, 8,412 paths
corpus-builder query --domain target.com --db corpus.db
# Subdomains for target.com (247 total):
# api.target.com
# admin.target.com
# ...
The database is separate from any individual engagement — it’s program-level intelligence that accumulates across sessions. When you start a new engagement against a target you’ve seen before, the corpus already has years of historical surface data. When subdomain enumeration finds something new, it gets compared against what Wayback already knew about.
What This Catches
Bug bounty programs have long-lived targets. An admin panel that was accessible
three years ago and then moved behind a VPN still shows up in Wayback. An API
version path (/api/v2/) that replaced an older one (/api/v1/) often leaves
the old routes live even after the documentation removes them.
CT logs catch infrastructure changes: a subdomain that was briefly on a different IP, a staging environment that had a cert issued for it, a deployment that used a different domain before the final name was registered.
Neither source requires sending a packet to the target. The corpus is built purely from passive public records — which matters both for program rules and for not alerting a WAF before you’ve decided what to test.
Part 8 covers session and payload-engine: how authenticated testing is set up
without storing credentials in plaintext, and how payload selection adapts to
what the fingerprinter found about the target stack.