GeistScope: OSINT Tools

Passive Before Active

Before sending any attack payloads or probing application endpoints, there’s a category of work that’s purely observational: reading what’s already public. Source code repositories, breach databases, document metadata, search engine indexes. None of this touches the target’s infrastructure in any meaningful way, and it frequently produces the highest-value findings in a program.

Six capabilities cover this space in GeistScope; document metadata extraction now lives under mg-artifact-audit metadata.

mg-github: Code Search for Secrets

MG_GITHUB_TOKEN=<token> mg-github target-bounty target.example.com

The tool uses GitHub’s code search API to find references to the target domain in public repositories. It searches for several categories: API keys and tokens, internal domain names appearing in configuration files, hardcoded credentials in deploy scripts, and connection strings.

Rate limits are handled precisely. The tool reads X-RateLimit-Remaining and X-RateLimit-Reset from every response. When remaining hits zero, it sleeps until the reset timestamp rather than failing or retrying blindly. Without a token, the limit is 10 requests per minute. With a token, it’s 30.

The search targets the organization’s repos specifically when the GitHub org name can be inferred from the target domain. Common patterns: an employee’s fork of an internal tool that includes a config file with real credentials, a public CI script that references MG_API_KEY, a build artifact accidentally committed with a secrets file still attached.

mg-breach: HIBP Domain Lookup

MG_HIBP_KEY=<key> mg-breach target-bounty target.example.com

HaveIBeenPwned’s v3 API requires an API key. mg-breach queries for all breaches containing email addresses from the target domain, then fetches breach details for any breach that included password data.

The rate limit is strict: 1 request per 1.5 seconds. The tool enforces this with a sleep between each request rather than hoping the API is forgiving.

The output tells you which credential dumps contain target-domain email addresses and passwords. This informs password spraying strategy, reveals credential reuse patterns, and is directly relevant to any login endpoint discovered during crawling. A company with employees in a major breach is a candidate for credential stuffing.

mg-social: Employee Enumeration

mg-social target-bounty target.example.com

mg-social enumerates GitHub organization members when a GitHub org can be identified for the target. From the member list, it generates email address candidates by combining usernames and real names against the target domain’s email format (first.last@domain, flast@domain, firstl@domain).

For LinkedIn, the tool generates a dork URL and prints it to stdout. There’s no API for LinkedIn enumeration, so the tool does what it can: give you the query string and let you run it manually. The employee list feeds into mg-breach analysis and social engineering context.

A token is optional but raises the GitHub API rate limit. Without it, member enumeration on large organizations will hit rate limits and produce partial results.

mg-artifact-audit metadata: Document Metadata Extraction

mg-artifact-audit metadata target-bounty

mg-artifact-audit metadata reads the crawl corpus and downloads PDFs, Office documents (DOCX, XLSX, PPTX), and JPEG images. From each file, it extracts metadata that developers and organizations routinely forget to strip before publishing.

PDF and Office documents often embed the author’s name, the software version used to create them, internal file paths, and revision history. DOCX, XLSX, and PPTX files are ZIP containers — the tool unzips them in memory and reads docProps/core.xml, which contains author, company, and creation/modification timestamps.

JPEG EXIF data is found by scanning for magic bytes in the file stream. GPS coordinates appear in EXIF when photos are taken on smartphones with location enabled. Camera make and model are routinely present. Some organizations publish press photos taken at their offices with GPS intact.

Internal file paths from Office documents are particularly useful: C:\Users\jsmith\Documents\CompanyInternal\ProjectX\ tells you the username format, the operating system, and sometimes internal project names that don’t appear anywhere public-facing.

mg-google-dork: Structured Search Engine Queries

mg-google-dork target-bounty target.example.com

The tool runs 14 built-in dork templates against the target domain. The templates cover login pages (inurl:login site:target.example.com), exposed configuration files (filetype:env site:target.example.com), API documentation (inurl:swagger site:target.example.com), directory listings, backup files, and several categories of juicy file types.

All 14 dork strings are printed to stdout every time, regardless of whether API execution is configured. This is by design: copy-paste the list and run them manually in a browser, or configure a Google Custom Search Engine key and CX ID to execute them programmatically.

MG_GOOGLE_CSE_KEY=<key> MG_GOOGLE_CX=<cx> mg-google-dork target-bounty target.example.com

With the API keys set, results are written to recon/dorks.json. Without them, the tool still produces the dork list, which is the primary output anyway.

mg-leak-monitor: Continuous GitHub Monitoring

mg-leak-monitor target-bounty target.example.com

The other GitHub tool runs once and searches existing code. mg-leak-monitor is a long-running process that polls the GitHub Search API for new commits from the target organization that mention the target domain. It’s designed to run in the background during an engagement.

State is persisted to recon/leak-monitor-state.json across restarts. When the tool starts, it loads the last-seen commit timestamp and only processes newer results. Findings are appended to findings/ as they arrive, following the standard engagement finding format.

# Run in background, findings appear as they're detected
mg-leak-monitor target-bounty target.example.com &

The use case: a developer accidentally commits a secrets file to a public repo during your engagement window. Without monitoring, you’d only catch it if it happened to fall within your one-time mg-github query window. With monitoring running, the finding appears within the next poll interval.

Reading the Output

OSINT tools write to recon/ for structured data and findings/ for anything that meets a severity threshold. The metadata and GitHub outputs are worth reviewing manually before moving to active testing — internal usernames and path structures from documents inform the wordlists and account targets used by auth testing tools. Breach data informs what credentials to try against login endpoints discovered during crawling.

Passive intelligence shapes where active testing focuses. Running these before the first scanner request means the active phase is more targeted.