LLMs Don't Hack, They Guess

Everyone wants to know if AI can find bugs. Jeet spent the past year throwing LLMs at real production code to answer that. The short answer: it depends on what you mean by "find bugs".

Jeet has been throwing LLMs at real production code at Robinhood and at open source projects for the past year. The result is a clearer picture of where the tools genuinely help, and where they quietly burn money.

The Simple Stuff Works

If you hide a pickle.loads in a 10,000-line codebase, Claude Code will find it in two or three minutes. Burp AI spots serialized objects in HTTP traffic and fuzzes them until something breaks. Easy VMs on Hack the Box, config-based checks, anything where the pattern is obvious and well-documented in training data. AI handles it. It's seen the blog posts. It's read the write-ups. It knows what pickle.loads looks like.

The problem starts when the bug isn't a pattern match.

The Helm Vulnerability

Jeet walked through a real CVE in Helm, the Kubernetes package manager. An attacker controls a chart.yaml file, runs a dependency update, and the output touches disk as a chart.log. Symlink that file to .bashrc or wherever, and next time the shell starts, it runs your code. RCE on macOS workstations, infecting CI/CD pipelines, Kubernetes takeover.

The catch: this bug spans about 20 layers of code. It takes human input from the update command and, throughout the process, outputs a file write at the end. When you point AI at it, you'd expect it to trace the whole chain. It doesn't. It looks at files individually instead of treating them as one connected flow.

The result: Cursor found false positives. Claude found false positives. Neither found the actual bug. About 65 cents, all wasted.

Why It Keeps Getting It Wrong

Pointing an LLM at a large repo and asking it to "find all security vulnerabilities" is one of the most reliable ways to waste money and erode trust in the tool. The model approaches the codebase like a breadth-first search with no depth limit. It fans out across every file, flags every pattern that resembles something dangerous, and never commits to tracing any single flow all the way through. What you get back is volume without signal: a long list of flagged patterns, most of them false positives, ranked by nothing.

It feels like coverage. It isn't. The model has no natural stopping point, no sense of what matters, and no way to distinguish a critical auth bypass from a theoretical edge case that will never be reached in production. Without a defined scope, it optimizes for breadth, touching everything, understanding nothing deeply enough to confirm exploitability.

Static reviews have their own failure mode. LLMs flag nearly every cross-service data flow as an IDOR because they see a user-supplied parameter crossing a microservice boundary. They don't model the five validation layers sitting in between. They see a potentially dangerous sink and treat it as exploitable, without checking whether the data was sanitized three hops upstream.

Dynamic testing has a different failure mode: path explosion. Take a single endpoint, POST /api/order/buy. Switch the account type from regular to business and it routes through wholesale pricing, different tax logic, and a new credit limit check. Switch the item from physical to digital and you've bypassed the inventory service entirely. Switch the action to sell and the entire call chain changes.

The black-box gap: AI can't navigate these branches systematically. It can't log in as different account types, toggle feature flags, or set up the account state required to exercise specific flows. Black-box testing can't reach what depends on application state.

What Actually Works

Two things matter: narrow scope and multi-agent workflows.

A specialized agent pipeline (a source-to-sink tracer, a threat modeler, and a vulnerability finder, each doing one job and passing structured output to the next) gets materially further than a single agent with a broad mandate. Pointed at the Helm codebase with a tightly scoped entry point, the tracer correctly identified the file write sink. The threat modeler identified the right abuse case. The final exploit step was still missed. But a researcher reviewing those logs would connect the dots significantly faster than starting cold.

~$5 One entry point, full effort, Opus 4.6. When a single source branches to many sinks, full tracing runs $20 to $30 per input.

Three Things That Matter In Practice

Bad prompting makes things actively worse, not just less good. There's research showing that poorly written repository-level config files reduce agent performance and increase cost by ~20%. The agents get verbose, redundant, and spend tokens narrating how well they followed instructions instead of doing the work.

A useful threat model is a filter, not a dump. Don't hand the agent an exhaustive checklist. Highlight where data crosses sensitive boundaries, identify the most critical assets, and state a handful of hard rules: "payment systems are only accessible by authorized services." Conciseness makes the agent's reasoning better, not worse. Be explicit about which sinks you care about: outbound HTTP calls, code execution, database writes. Without this, the agent has no basis for prioritizing what it traces. With it, false positive volume drops sharply because the model is checking against a defined target instead of pattern-matching on vibes.

Start with one entry point. One endpoint, one file, one command. Let it trace outward as much as it can. The tighter the scope, the less room for the model to get confused and expensive.

#AI #LLM #AppSec #OffensiveSecurity #VulnDiscovery #DC416 #Cybersecurity #InfoSec #Toronto

Speaker: Jeet, Offensive Security Engineer @ Robinhood
Media: Thanks to Ali and Jason for the video production
Event: DC416 March Meetup, March 19, 2026

Blog