BLOG_POST / secrets-scanning-zero-dev-setup

Secrets scanning for a 200+ repo GitHub org, with zero developer setup

7 min read
1219 words
tl;dr summary

We built secrets scanning that developers never have to think about. Every push is scanned, findings are deduplicated by commit SHA, stored without secret values, and routed to the right humans fast.

When you have 200+ repositories and hundreds of pushes per day, secrets will get committed. Not because developers are reckless, but because humans are busy, juniors are learning, and legacy repos have gravity.

At org scale, the most common failure mode is not detection. It is adoption.

If you need every repo owner to opt in, every team to retrofit CI, and every developer to install tooling locally, the scanner only reaches the repos that already care. The repos that need it most are often the oldest and least maintained.

So we built secrets scanning that developers never have to think about.


Goals and non-goals

Goals:

  • Scan every push, every branch, across every repo in the org
  • Require zero developer setup (no hooks, no CI integration)
  • Route findings quickly and predictably
  • Store no secret values

Non-goals:

  • Blocking pushes or merges
  • Scanning developer laptops or anything outside GitHub org repos
  • Solving full-history scanning as the default path

The architecture (one sentence)

GitHub push webhook -> verify authenticity -> dedupe by commit SHA -> AWS Lambda clones repo at that SHA -> TruffleHog scans filesystem with verification -> normalize findings -> store metadata + SHA-256 only -> Slack alert + read-only dashboard.

This is intentionally boring. Boring is how you keep security controls running for years.


Scope and definitions

In scope:

  • every repository in the GitHub org
  • every push event
  • every branch

The unit of work is a commit SHA. The scan target is the repo checked out at the pushed SHA, which makes results reproducible: “scan this snapshot”.

Out of scope:

  • anything not pushed to GitHub org repos (laptops, other git hosting, registries, build logs)

What counts as a secret:

  • API keys
  • cloud credentials (AWS, GCP)
  • SSH keys, private certs
  • vendor tokens (e.g., Slack, Stripe, Mail services)
  • high-entropy strings that are likely passwords

The safe stance is simple: if a secret lands in git history, assume it is compromised.


Current system flow

GitHub org push

Webhook receiver

Idempotency commit SHA

Lambda scan job

git clone depth=1

checkout commit SHA

TruffleHog filesystem scan + verify

JSON findings

Hash-only findings store

Slack alert

Read-only dashboard


Webhook receiver: authenticity + idempotency

The webhook receiver does two things and must be correct:

  1. Verify authenticity using GitHub HMAC signatures.
  2. Enforce idempotency using the pushed commit SHA as the dedupe key.

GitHub can retry deliveries. Retries are normal. Deduping by commit SHA makes duplicates cheap.

Pseudocode:

function handlePushWebhook(req) {
  if (!verifyGithubHmac(req)) return 401;

  const sha = req.payload.after;
  if (alreadyScanned(sha)) return 200;

  markScanned(sha);
  invokeLambdaScan({ sha, repo: req.payload.repository.full_name, branch: req.payload.ref });

  return 202;
}

We store just enough state to remember which SHAs have been scanned. That state is not sensitive and does not include secret material.


Scan job: clone, checkout, scan

Each scan job is stateless and repeatable:

  1. git clone --depth 1
  2. checkout the pushed commit SHA
  3. run TruffleHog in filesystem mode with verification enabled
  4. ingest JSON findings, normalize fields
  5. alert Slack (no secret values)

We scan the repository as a filesystem snapshot at that commit. This aligns with the operational question we care about: “did a secret just enter the repo contents?”

If you want full-history scanning, you can do it. It is just a different runtime profile and a different cost model.


TruffleHog config choices

We use TruffleHog as a plain binary, with configuration shaped around predictable operations:

  • filesystem mode (commit snapshot)
  • verification enabled (when supported) to reduce false positives
  • JSON output for structured ingestion
  • exclude .git and respect .gitignore to reduce noise

We also prefer contributing fixes upstream rather than carrying a fork, because the sovereignty we care about is in the integration and the workflow, not in maintaining a bespoke detector suite forever.


Findings storage: hash-only, current-state-only

The most important design constraint: we do not store secret values. Not in the database. Not in Slack. Not in the dashboard.

Instead we store:

  • repo, branch
  • file path
  • provider/detector label (+ verification status when available)
  • a SHA-256 hash of the secret value (computed in-memory, then discarded)
  • timestamps (first_seen_at, last_seen_at)

That hash is useful because it gives us a stable identifier for deduplication and “is this the same credential resurfacing?” without retaining the credential itself.

We also keep the store focused on the operational present: it holds current open findings. When a finding is no longer detected, the row disappears.

This makes the dashboard useful for triage. If you later want analytics (trendlines, MTTR by repo, recurrence rates), add an append-only event log on the side.


Alerting, routing, remediation

Alerts land in Slack with enough context to act, and nothing sensitive:

  • repo + branch
  • commit SHA + author
  • file path
  • provider + verification status
  • dashboard link

Routing stays boring: notify the project owner and include the commit author for context. The goal is fast remediation, not blame.

Remediation is intentionally simple:

  1. Remove the secret from the repo
  2. Rotate or revoke the credential
  3. Confirm the scanner no longer detects it (finding disappears)
  4. Prevent recurrence (move secrets to a manager or env injection)

We treat findings as critical by default. Severity games are rarely helpful when something has been committed to git.


Metrics

Findings per week

Findings per week
loading chart…
Weekly new vs resolved findings.

Open findings at week end

Open findings at week end
loading chart…
Open findings trending down after baseline cleanup.

Mean time to remediate

Mean time to remediate
loading chart…
Mean time to remediate distribution (buckets).

Top leaked providers

Top leaked providers
loading chart…
Provider categories from findings.

Repo hygiene snapshot

Repo hygiene
loading chart…
Clean repos vs repos with open findings.
Clean repo rate
loading chart…
Share of repos with no open findings.

Security posture and ops

This system is not a perimeter fortress. It is a targeted control:

  • webhook authenticity via HMAC signature verification
  • idempotency via commit SHA dedupe
  • read-only GitHub token for cloning
  • hash-only storage (no secret values persisted)

Scans run in AWS Lambda (1 GB memory) with a concurrency cap to handle bursty push patterns. Typical push-to-alert latency is about 40 seconds.

In practice, the cost is close to zero for us because we leverage AWS free-tier.


Where this goes next

The obvious next step is conservative automation: opening a remediation PR that removes a leaked value and replaces it with an environment reference.

We shipped detection + routing first. It eliminates most of the risk quickly and makes remediation boring.

hash: c2e
EOF