Claude Code Security: a day with it on a live codebase

It found things a recent independent pen test did not. It also found nothing critical. Both observations matter — and the more interesting question for security leaders isn't about scanners. It's about what this means for the pen-testing model.

I spent a day running Claude Code Security against a real, production codebase — not a sample app, not a deliberately-broken benchmark. The point was to see what it actually does in the hands of a team that has already done its homework: mature development practices, an independent penetration test on the books recently, the usual secure-SDLC scaffolding. The result was more interesting than I expected, and worth writing up honestly rather than as a vendor anecdote.

Two observations sit side by side and they point in different directions. It surfaced real issues that the recent independent pen test had missed. And it found nothing critical. Both of those statements are doing work, and a security leader has to hold them in tension.

What stood out, after a full day on it

N.01Low noise. No wading through hundreds of findings to extract the handful that matter. The signal-to-noise ratio is the most immediate, tactile difference compared to traditional SAST.
N.02It surfaced real issues a recent independent pen test had missed. Not theoretical noise, not cosmetic findings — actual exposures that a competent human team had not flagged.
N.03No critical findings. That does not prove there are none. But combined with a mature development team and prior assurance work, it is a useful corroborating signal.
N.04Many findings were conditional. Issues that only become exploitable if another control fails — a misconfigured security header, for example. Useful context, but it changes how you triage.
N.05Cost is material. This is unlikely to sit in every CI/CD pipeline. I'd think of it more as a quarterly release gate, a major-change review, or a pre-audit checkpoint than as continuous scanning.

The disruption isn't where most people are looking

The first instinct of most security teams is to compare a tool like this against their existing SAST stack. That comparison is mostly the wrong frame. Legacy code scanners are cheap, run on every commit, and produce volume; this tool is expensive, runs occasionally, and produces signal. They sit in different parts of the workflow.

From a leadership perspective, the more interesting disruption is somewhere else. It's to parts of the penetration-testing model.

That model has always relied on human expertise being the hardest part to replicate. For manual code review, finding generation, and first-pass security reasoning, that gap is closing faster than many in the industry are comfortable admitting.

The high-value parts of a good pen test — adversarial creativity, business-logic intuition, the ability to chain weaknesses across systems, and crucially, the human judgement on what actually matters at this organisation — those remain hard to replicate. But the foundational layer underneath them, the careful manual code review that produces the first-pass list of plausible exposures, is exactly the work this class of tool now does well.

The shift

The economics of pen testing assumed human-graded code review was scarce. When a credible portion of that review can be done by a model in hours, the value proposition of an engagement shifts toward the parts humans are still uniquely good at. Pen-test providers who don't reposition will compete against a tool with a marginal cost measured in API calls.

Where it fits in a security programme

The honest answer for most organisations is: not everywhere, and not as a replacement for anything. It earns its place at specific decision points where the depth of review justifies the cost.

FIG 01A pragmatic placement

Use case	Cadence	Notes
Quarterly release gate	Every 3 months	Catches drift between major audits; pairs well with internal review.
Major-change review	Per change	New auth model, payment flows, agentic tooling, third-party integrations.
Pre-audit checkpoint	2–4 weeks pre-audit	Reduce surprises; cheaper than fixing under audit pressure.
Continuous CI/CD	Every commit	Cost-prohibitive at scale; existing SAST still the right shape here.

What to take to the board

If you're a CISO or CTO, the conversation with your board this quarter probably shouldn't be “should we buy Claude Code Security?” — that's a procurement question. The more useful conversation is two questions up:

Q.01Do we believe our existing assurance stack is finding what matters? If a fresh tool can surface real issues a recent independent pen test missed, that's a data point about the assurance stack, not just about the tool.
Q.02How much of our pen-testing spend is buying first-pass code review vs. genuine adversarial expertise? If it's the former, the unit economics have changed. If it's the latter, the value is still very much there \u2014 you just want to be sure that's what you're paying for.
Q.03Where would AI-assisted review earn its keep at our maturity level? Not as a continuous scanner; as a decision-point instrument at release gates, major changes, and pre-audit windows.

A measured conclusion

Every organisation will need to decide whether the findings justify the price at its current security maturity. For a less mature programme, the cost is harder to justify because the cheaper tools and disciplines further upstream haven't been exhausted. For a more mature programme, the value is in the corner cases — the issues a careful team and a recent pen test still missed.

The broader signal, though, is the one worth carrying into the next board cycle: the capabilities that used to require expensive human time at the front end of security review are now mechanisable. That changes the economics of a whole layer of the industry. The question for security leaders isn't whether to be excited or worried about that — it's whether your assurance programme is shaped to take advantage of it.

Notes from a working day, not a vendor pitch. If you're weighing where AI-assisted code review fits in your assurance stack, I'm happy to talk it through.

Claude Code Security: a day with it on a live codebase.

What stood out, after a full day on it

The disruption isn't where most people are looking

Where it fits in a security programme

What to take to the board

A measured conclusion

Claude Mythos and what it means for Organisations & BoD.

The agent evaluation problem: why 87% passing is worse than 40% failing loudly.