The Agentic Software Stack Is the New Paradigm
Anthropic’s Claude Code leak mattered for one reason above all others: the incident did not expose model weights, and it was not framed by Anthropic as a breach of customer data. Public reporting said a source map shipped in the npm-distributed CLI exposed reconstructable TypeScript source for Claude Code, while Anthropic said the problem was a release packaging issue caused by human error and that no sensitive customer data or credentials were exposed. That distinction changes the technical conversation. It moves the center of gravity away from “the model” and toward the much larger system that surrounds it. (The Verge)
That larger system is what matters now. Claude Code’s own documentation describes it as an agentic coding tool that reads a codebase, edits files, runs commands, and integrates with development tools. Anthropic’s Agent SDK goes further and says developers can use the same tools, agent loop, and context management that power Claude Code programmatically from Python and TypeScript. Once a system can read state, modify state, call tools, persist memory, delegate to subagents, and connect to external services, the security boundary is no longer the model boundary. It is an execution boundary made of code, configuration, memory, authorization, orchestration, and observability. (anthropic.com)
That is why the phrase “white-box auditing” needs to be redefined. In older software security work, white-box auditing meant source review, data-flow analysis, build-chain inspection, and reasoning about trust boundaries from inside the application. In alignment research, the term often points in a different direction: access to weights, activations, or interpretability tools. Both uses remain valid. Neither is sufficient on its own for modern agent systems. The post-leak lesson is that a serious white-box audit for an AI agent must account for the full agentic software stack, from publish artifacts and provenance to hooks, memory files, MCP permissions, subagent inheritance, runtime traces, and post-incident replay. (arXiv)
Claude Code leaked more than source
Source maps are mundane until they are not. MDN defines a source map as a JSON format that maps transformed code back to the original unmodified source, allowing the original code to be reconstructed and used for debugging. In normal engineering work, that is useful. In a packaging mistake, it becomes an information exposure surface. Public reporting on March 31 said that a Claude Code release shipped with a source map file that let outsiders recover large portions of the TypeScript codebase, and Anthropic’s public statement emphasized that the issue was packaging-related rather than a conventional intrusion. (developer.mozilla.org)
That matters because the leak made visible what many teams still try to compress into the word “model.” The exposed surface was not a single prompt or a narrow API wrapper. Public reports and reverse-engineering discussion described slash commands, built-in tools, internal instructions, and memory-related behavior. Even without taking every community claim at face value, the broad picture is clear enough: a coding agent is not a chatbot with one shell command attached. It is a runtime with persistent context, permission mediation, tool schemas, files, and workflow logic. (The Verge)
The official documentation already says as much. Claude Code is described as reading code, editing files, running commands, and integrating with development tools. The Agent SDK says the same loop can be embedded in production applications. Memory documentation says CLAUDE.md and auto memory are both loaded at the start of every conversation and treated as context rather than enforced configuration. In other words, a significant part of the system’s behavior is not stored only in code or only in model weights. It is distributed across multiple stateful layers. (anthropic.com)
That distribution is exactly why white-box auditing has to expand. If a future security team says it has “audited the agent” because it inspected a system prompt or ran a few interpretability probes, that team has probably audited the least operationally predictive part of the system. The most load-bearing facts now live in questions like these: what files got shipped, what configuration scopes were active, what memory was loaded, what hooks mutated inputs or outputs, what MCP servers were reachable, what permissions were inherited by background subagents, and what runtime evidence was preserved for later reconstruction. (Claude API Docs)
White-box auditing used to mean something simpler
Traditional software white-box auditing starts from a stable intuition. There is source code. The code defines the trust boundary. Inputs enter at known points. Authorization logic is represented in functions, classes, or policy files. Reviewers reason about taint propagation, template expansion, sandbox boundaries, shell invocation, secret handling, and dependency integrity. Even when systems are large, there is still a comparatively clear separation between code, configuration, deployment artifact, and runtime behavior.
Agent systems collapse those separations. A modern agent can carry user-authored rules in CLAUDE.md, model-authored notes in auto memory, organization-authored managed policy, repository-authored .mcp.json, prompt-injected context from a file, tool output from an external service, and background work delegated to a subagent that inherits a subset of permissions. The resulting action path is not determined by code alone. It is determined by code plus accumulated state plus policy plus external tool behavior. A white-box audit that cannot reconstruct those layers is no longer white-box in the practical sense that matters to defenders. (Claude API Docs)
The old picture also assumed that code review was the main route to trustworthy inference. But the newer AI-safety literature already shows why that assumption breaks down. Anthropic’s March 2025 work on hidden objectives framed alignment audits as systematic investigations into whether models are pursuing hidden goals, and it explicitly combined training-data analysis, behavioral techniques, and white-box methods such as sparse autoencoders. That paper did not argue that internal visibility is useless. It argued that deeper investigation is necessary precisely because apparent behavior alone can be misleading. (anthropic.com)
By March 2026, Anthropic’s AuditBench results added an important nuance. Their benchmark write-up said scaffolded black-box tools were the most effective overall for auditing hidden behaviors, while white-box interpretability tools could help but mainly on easier targets. That finding should be read carefully. It does not demote white-box work. It means the center of gravity in real auditing is shifting toward system-level, tool-using investigator workflows rather than a single privileged lens into weights or activations. (alignment.anthropic.com)
Anthropic’s public risk reporting points in the same direction. In the February 2026 risk report, the company said it does not yet have a sufficient understanding of model internal states to make a definitive assessment by studying the model in isolation, and it described using converging sources of evidence instead. The same report also said that many automated behavioral audit transcripts incorporated the real Claude Code system prompt and tool set as used internally. That is the key architectural signal. Frontier evaluation is already moving from “test the model in a vacuum” to “test the deployed stack under realistic conditions.” (anthropic.com)
The agentic software stack has a wider audit surface
Claude Code’s own docs are unusually helpful because they expose, in concrete engineering terms, what the new audit surface looks like. The memory docs say each session begins with a fresh context window, but knowledge carries across sessions through user-authored CLAUDE.md files and model-authored auto memory. Both are loaded into each conversation, and Anthropic explicitly says Claude treats them as context, not enforced configuration. Auto memory is on by default, stored in machine-local markdown files, and auditable because those files can be read, edited, or deleted directly. That means memory is not a mystical emergent behavior. It is a stateful input plane with a file-backed representation. (Claude API Docs)
The hooks system makes the execution plane even more explicit. Anthropic’s reference lists UserPromptSubmit, PreToolUse, PermissionRequest, PostToolUse, PostToolUseFailure, SubagentStart, SubagentStop, TaskCreated, TaskCompleted, and other lifecycle events. UserPromptSubmit can add context or block a prompt before processing. PreToolUse can allow, deny, or ask before a tool runs, and can modify tool input. PermissionRequest can automatically allow or deny on the user’s behalf and even rewrite the tool input that will be executed. For MCP tools, post-tool hooks can replace the tool’s output through updatedMCPToolOutput. These are not small details. They mean the runtime already contains first-class interception points that can mutate causality. (Claude)
The same docs show how tool use and MCP use merge. Anthropic says MCP server tools appear as regular tools in hook events such as PreToolUse, PostToolUse, PostToolUseFailure및 PermissionRequest. That makes MCP part of the same decision surface as built-in tools. If a white-box audit ignores MCP because it lives “outside” the app, it misses one of the most important trust transitions in the system. MCP is the protocol that connects agent applications to data sources, tools, and workflows. Once that bridge exists, external systems are not peripheral. They are part of the execution fabric. (Claude)
Subagents extend the surface again. Anthropic’s subagent docs say file-based subagents can carry their own mcpServers, hooks, memory, background, and isolation-related settings. A subagent can be given MCP servers that are not available in the main conversation. It can maintain persistent memory at user, project, or local scope. Background subagents run concurrently, receive permissions up front, inherit those permissions while running, and automatically deny anything not pre-approved. That combination means a real audit must reason about delegated execution and inherited authority, not just about one linear chat transcript. (Claude API Docs)
Configuration scope is equally important. Anthropic’s settings docs distinguish managed, user, project, and local scopes, and they explicitly position managed scope for organization-wide security policies and compliance requirements that cannot be overridden. The enterprise deployment guidance says a central team should configure MCP servers and check a .mcp.json file into the codebase for shared use. This is the clearest available signal that agent governance is becoming an enterprise configuration problem as much as a model problem. White-box auditing has to follow that shift. It must account for who controls policy, where policy is stored, how it is inherited, and how it changes across machines, repositories, and users. (Claude API Docs)
Anthropic’s security engineering posts add another missing piece: runtime containment. The company says Claude Code runs on a permission-based model and that sandboxing adds filesystem and network isolation. Anthropic also says that sandboxing safely reduced permission prompts by 84 percent in internal usage, while a later auto-mode post said Claude Code users approve 93 percent of permission prompts and that classifiers were introduced to reduce approval fatigue. Those two figures matter for auditors because they explain why “permission prompts” are not enough as a security control. If humans approve almost everything, then the meaningful control point is not the dialog box. It is the architecture around that dialog box: sandbox boundaries, policy hooks, domain allowlists, and traceable decision logic. (anthropic.com)
A compact way to think about the new surface is to stop asking where the model sits and start asking where authority changes hands. Authority moves when code is built, when a package is published, when a memory file is loaded, when a policy scope overrides another, when an MCP server receives a token, when a subagent inherits permissions, when a hook rewrites tool input, and when the system decides that a prompt, tool output, or prior memory is trustworthy enough to act on. Those are the joints where future white-box auditing has to live. (modelcontextprotocol.io)
| Legacy white-box auditing | Agentic software stack auditing |
|---|---|
| Source code is the primary behavior artifact | Behavior emerges from code, prompts, memory, policies, hooks, tool outputs, and delegation |
| Authorization lives mostly in application logic | Authorization is split across permission modes, hooks, MCP auth, sandbox rules, and config scopes |
| Build pipeline proves software origin | Build pipeline must also prove prompt bundles, policy bundles, tool manifests, and shipped artifacts |
| Runtime logs are usually enough for incident response | Runtime logs must include tool calls, memory reads and writes, hook decisions, subagent lineage, and egress |
| Post-fix validation focuses on patched routes | Post-fix validation must also test memory contamination, tool misuse, policy regressions, and delegated paths |
The table above synthesizes Anthropic’s Claude Code docs and engineering notes with the MCP authorization spec, OpenTelemetry’s GenAI conventions, and software supply-chain standards such as SLSA and in-toto. Taken together, those sources show that agent security has shifted from narrow app logic review to end-to-end system provenance and execution governance. (Claude API Docs)

Build and release provenance is now part of white-box security
The most direct lesson from the Claude Code incident is that build and release provenance can no longer be treated as secondary. SLSA describes provenance as verifiable information about software artifacts that records where, when, and how they were produced. In-toto goes further and describes the supply chain as a series of steps whose integrity and ordering can be attested, making transparent what happened, by whom, and in what order. Those ideas were already important for ordinary software. Agentic systems make them more important because the artifact now carries not just code, but also policy, tool wiring, and execution affordances. (SLSA)
The xz backdoor is still the clearest historical warning. NVD’s entry for CVE-2024-3094 says malicious code was discovered in upstream tarballs, and that the build process extracted a prebuilt object file from a disguised test file to modify specific functions during build. That incident showed, in painful detail, that what gets built and shipped can diverge materially from what reviewers think they saw in the source tree. Agentic software inherits that same risk, but with more consequences: if the shipped artifact also controls tool execution, network reach, memory handling, or approval logic, a compromised build affects not just code execution but decision execution. (nvd.nist.gov)
The npm documentation is blunt on this point. npm says .npmignore can be used to keep files out of a package, the 파일 field in package.json can be used as an allowlist, and npm pack should be run locally to verify what will actually be published. It also warns that pretty much everything in the folder is exposed by default when published unless it is excluded. That guidance predates AI agents, but its relevance has increased. If the package includes source maps, fixture data, debug manifests, local credentials, or internal prompt resources, the published artifact may disclose far more than the code path developers believed they were shipping. (docs.npmjs.com)
For agent systems, the attestation target has to expand beyond ordinary binaries. A secure build record for an AI agent should include, at minimum, the source commit, the dependency graph, the bundle settings, the list of files actually published, the hashes of policy bundles, the hashes of shipped prompts or instruction files, the MCP manifest, and the sandbox or permission defaults compiled into the runtime. SLSA and in-toto already provide the language for describing and verifying artifact history. The missing step is to treat AI-native artifacts as first-class supply-chain subjects rather than as informal side files. (SLSA)
A practical packaging guard for any CLI-based agent is surprisingly simple:
set -euo pipefail
rm -f ./*.tgz
npm pack --json > pack-manifest.json
TARBALL=$(jq -r '.[0].filename' pack-manifest.json)
echo "Contents of publish artifact:"
tar -tf "$TARBALL" | sort
echo "Fail if source maps or obvious debug artifacts are present"
tar -tf "$TARBALL" | grep -E '\.map$|debug/|fixtures/|test-data/' && {
echo "Unexpected publish artifact detected"
exit 1
} || true
echo "Fail if packaged file count changed unexpectedly"
EXPECTED_MAX=250
COUNT=$(tar -tf "$TARBALL" | wc -l | tr -d ' ')
test "$COUNT" -le "$EXPECTED_MAX"
The point of a guard like this is not elegance. It is forcing the real publish artifact to become reviewable before release. npm explicitly recommends npm pack as a way to verify what a package will include, and the Claude Code incident is a live reminder that release-time visibility matters as much as source review. (docs.npmjs.com)
Policy is code, prompt, memory, and configuration
A second major change is that policy no longer lives in one place. Anthropic’s memory documentation says CLAUDE.md files and auto memory are both loaded at the start of every conversation, and that Claude treats them as context rather than enforced configuration. That one sentence matters more than it first appears to. It means any audit that wants to explain why the agent behaved as it did must include not only the base prompt but also the exact instruction files, memory index, and scope resolution that were active for that session. (Claude API Docs)
The settings model compounds this. Managed scope, user scope, project scope, and local scope all have different trust semantics. Anthropic explicitly says managed scope is for security policies that must be enforced organization-wide and for compliance requirements that cannot be overridden, while project scope is shared via source control and local scope is repository-specific and gitignored. For a large enterprise, that means a single action may be shaped by organization-managed policy, repository-managed policy, local overrides, and machine-local memory. The audit question is no longer just “what did the code do?” It is “which policy stack was in effect when the code did it?” (Claude API Docs)
Hooks make this even more concrete. UserPromptSubmit can inject extra context before the model ever sees the user message. PreToolUse can deny, allow, or escalate a tool call and can rewrite its input before execution. PermissionRequest can automatically allow a request and change its input parameters. PostToolUse can block further progress or add context, and for MCP tools it can replace the tool output returned to the model. That means policy is not static metadata. It can be active runtime mediation. From an auditing standpoint, hooks are not just a customization feature. They are part of the execution logic and have to be captured as such. (Claude)
The most interesting part is that Anthropic’s own docs already hint at what good policy instrumentation looks like. The permissions docs recommend blocking curl 그리고 wget through deny rules, using WebFetch(domain:github.com) for allowlisted fetches, and using PreToolUse hooks to validate URLs in Bash commands. That is exactly the right design instinct for the next generation of white-box auditing. Do not try to divine safety from the model’s intention alone. Constrain the action surface, name the allowed domains and paths, and create explicit decision points whose outcomes can be logged and verified. (Claude)
An illustrative control script looks like this:
# illustrative PreToolUse policy logic
# reads a proposed Bash command and denies external fetches
# unless the destination matches an approved allowlist
import json
import re
import sys
payload = json.load(sys.stdin)
tool = payload.get("tool_name")
cmd = (payload.get("tool_input") or {}).get("command", "")
ALLOWED = {"github.com", "docs.company.internal"}
BLOCK_PATTERNS = [r"\bcurl\b", r"\bwget\b", r"\bInvoke-WebRequest\b"]
def extract_domains(command: str):
return set(re.findall(r'https?://([A-Za-z0-9.-]+)', command))
if tool == "Bash" and any(re.search(p, cmd) for p in BLOCK_PATTERNS):
domains = extract_domains(cmd)
if not domains or not domains.issubset(ALLOWED):
print(json.dumps({
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": (
f"Outbound fetch blocked. Approved domains: {sorted(ALLOWED)}"
)
}
}))
sys.exit(0)
sys.exit(0)
This kind of policy is useful even if the underlying model is excellent. Anthropic’s own engineering write-ups say prompt injection is a real concern once a coding agent can navigate a codebase and run commands, and their sandboxing design is built specifically to address that risk with filesystem and network isolation. That is the right frame: policy is part of the trusted computing base for the agent. Auditing must treat it that way. (anthropic.com)
Memory is a security surface, not a convenience feature
Memory deserves separate treatment because it breaks a habit that security teams still have from traditional application review. In classical software, persistence is usually explicit: a database table, a file, a cache entry, a durable job queue. In agent systems, persistence often looks like “just context.” That phrase is dangerous. Anthropic says auto memory is on by default, stored in plain markdown under a project-specific directory, and that Claude reads and writes memory files during a session. The first part of MEMORY.md is loaded into every conversation; other topic files can be read on demand. That is not fuzzy personalization. It is a persistent state layer with direct causal impact on future decisions. (Claude API Docs)
Once that is understood, two consequences follow. First, memory poisoning becomes a first-class audit concern. If an attacker can cause durable notes to be written into a project’s memory store, later sessions may inherit misleading build commands, altered operational assumptions, or unsafe workflow shortcuts. Second, memory provenance becomes necessary. A useful audit trail needs to answer who or what created a memory entry, under which task, after which tool outputs, and whether the entry was later edited or superseded. Otherwise investigators are left with a durable but contextless influence artifact. (Claude API Docs)
Subagent memory raises the stakes. Anthropic’s docs say subagents can maintain their own persistent memory and that their system prompt includes instructions for reading and writing to the memory directory. The docs also recommend asking a subagent to consult or update its memory as part of recurring work. That is operationally attractive, but it creates a second-order governance problem: future behavior may be influenced not only by direct human instructions and main-agent memory, but also by delegated agents recursively curating their own local knowledge bases. A future white-box audit has to represent that lineage explicitly. (Claude API Docs)
This is one place where the agentic stack really does demand new paradigms rather than new buzzwords. In ordinary application logging, it is enough to know that a row changed. In agent systems, you need something closer to “memory version control”: entry hash, author type, parent task ID, evidence links, supersedes relation, and load history. Without that, a responder cannot tell whether a harmful action came from current instructions, stale memory, poisoned memory, or legitimate accumulated experience. (Claude API Docs)
MCP turns external systems into part of the trusted runtime
MCP is where the stack stops pretending to be self-contained. Anthropic describes MCP as an open protocol for connecting AI applications to external systems, including data sources, tools, and workflows. The protocol is intentionally broad because agent usefulness depends on that breadth. The security consequence is equally broad: whatever the agent can reach through MCP becomes part of its effective attack surface and therefore part of the audit surface. (Claude API Docs)
The MCP authorization story is a good example of why this matters. The current specification says authorization is optional for MCP implementations, and when it is supported, HTTP-based transports should follow the authorization specification while stdio transports should instead retrieve credentials from the environment. A separate enterprise-managed authorization extension says organizations can centralize access through their identity provider rather than requiring each employee to authorize each server independently. This is a sensible evolution, but it also means auditors must reason about multiple deployment modes, some stronger than others, with different trust assumptions. (modelcontextprotocol.io)
The spec also names the confused deputy problem directly and requires OAuth 2.1-style protections, short-lived tokens, secure storage, and explicit token audience validation in the authorization guidance. Those are familiar security patterns. What is new is that they now sit inside an AI protocol used to wire models into external systems. The old IAM question was “who can call this API?” The new execution-governance question is “what can this agent convince a connected system to do, under what context, with what traceable justification, and through which delegated chain?” (modelcontextprotocol.io)
Claude Code’s docs reinforce the operational reality. Anthropic recommends that one central team configure MCP servers and commit .mcp.json into the codebase for shared use. That is a production pattern, not a research toy. Once a repository ships a shared MCP manifest, the boundary between “source tree” and “runtime integration topology” is gone. The repo itself now defines some of the systems the agent can talk to. White-box auditing must therefore include repository-native integration manifests as part of normal review. (Claude API Docs)
This is also where tool-output integrity becomes critical. Anthropic’s hooks reference says MCP tools are just tools as far as lifecycle events are concerned, and post-tool hooks can replace MCP tool output before the model consumes it. That feature is useful for mediation and normalization. It also means the audit record must capture both original and transformed output if investigators want to preserve causal truth. A future auditor should be able to answer not only “what did the MCP server return?” but also “what final value did the model see after hook-time mutation?” (Claude)
The CVEs already explain the new boundary
The CVE stream is one of the best ways to see that the new paradigm is not theoretical. The vulnerabilities that matter most for agent systems are not just “LLM jailbreaks.” They sit across build, gateway, IDE, codegen, and tool-handler layers. Each one maps cleanly to a part of the agentic software stack that defenders need to treat as auditable infrastructure.
Start with CVE-2024-3094 in xz. NVD says malicious code was inserted into upstream tarballs and activated during the liblzma build process by extracting a prebuilt object file from a disguised test file. This is relevant because every AI agent product still depends on conventional software supply-chain trust. If a maintainer can review source and still ship a malicious build artifact, then auditing the code alone is not enough. Build provenance, artifact comparison, and attestation become part of agent security whether teams like it or not. (nvd.nist.gov)
Next, CVE-2026-1868 in GitLab AI Gateway. NVD says the Duo Workflow Service component was vulnerable to insecure template expansion of user-supplied data via crafted Duo Agent Platform Flow definitions, with possible denial of service or code execution on the gateway. That is a nearly perfect example of why the workflow layer is part of the stack. The “AI” part of the system is not where the bug lived. The dangerous surface was a workflow gateway that mediated agent behavior. If a security program audits only model prompts and ignores the flow engine, it is auditing the branding layer, not the risk layer. (nvd.nist.gov)
CVE-2026-22785 in Orval is equally revealing. NVD says the MCP server generation logic in versions before 7.18.0 used string manipulation that incorporated the OpenAPI summary field without proper validation or escaping, allowing code injection. This is a strong signal that AI-native tool generation inherits old codegen mistakes in new places. The lesson is not just “sanitize strings.” The lesson is that tool manifests, schema ingestion, and agent-facing code generation are now part of the trusted path to execution. White-box auditing has to inspect them as rigorously as any templating engine or CI script. (nvd.nist.gov)
CVE-2026-33980 in the Azure Data Explorer MCP Server takes the point further into runtime. NVD says versions up to 0.1.1 contained KQL injection vulnerabilities in MCP tool handlers such as get_table_schema, sample_table_data및 get_table_details. That is exactly the kind of bug defenders should expect to see repeatedly in the next few years: established classes like query injection reappearing inside tool handlers that were introduced to make agents useful. The protocol is new; the failure mode is not. But because the handler sits behind an agent-facing abstraction, the impact can be amplified by autonomous retries, tool chaining, or confused-deputy effects. (nvd.nist.gov)
CVE-2025-65715 in the VS Code Code Runner extension shows that the developer environment itself is still part of the boundary. NVD says the executorMap setting allowed arbitrary code execution when a crafted workspace was opened. That matters for agentic tooling because many coding agents now live inside IDEs, terminals, and workspace-aware extensions. A repository can carry not just source but also the local context that determines what an agent or extension will execute. White-box auditing has to include workspace and editor configuration review, especially when the agent is expected to act inside that environment. (nvd.nist.gov)
A closely related case, CVE-2025-65716 in Markdown Preview Enhanced, shows how content rendering can also become a stepping stone. NVD’s entry says a crafted Markdown file could trigger arbitrary code execution in the extension. That is relevant because agent workflows increasingly rely on generated artifacts, previews, and embedded documentation as part of development and review. When the agent environment renders untrusted content inside a privileged toolchain, preview logic becomes part of the execution boundary too. (nvd.nist.gov)
| CVE | Stack layer | Why it matters for agentic systems |
|---|---|---|
| CVE-2024-3094 | Build and release provenance | Shipped artifacts can diverge from reviewed source, so code review alone cannot prove trust |
| CVE-2026-1868 | Workflow and gateway layer | Agent orchestration services can become direct RCE and DoS surfaces |
| CVE-2026-22785 | Tool schema and codegen layer | Agent-facing tool generation can turn metadata ingestion into code execution |
| CVE-2026-33980 | MCP handler runtime | Tool handlers can recreate classic injection classes behind agent abstractions |
| CVE-2025-65715 | IDE and workspace layer | Repository-local settings can influence command execution in the developer environment |
| CVE-2025-65716 | Rendering and preview layer | Untrusted content viewed in agent-adjacent tools can become an execution path |
The pattern in the table is more important than any single entry. Risk is concentrating in orchestration code, gateway services, tool handlers, IDE glue, code generation, and packaging boundaries. Those are all parts of the agentic software stack. None of them are well described by phrases like “prompt safety” alone. (nvd.nist.gov)
Why pure model white-box techniques are not enough
None of this means model-internal white-box methods have stopped mattering. Anthropic’s hidden-objectives work explicitly studied white-box techniques that require model weights or internal activations, and the paper treats sparse autoencoders and related interpretability methods as part of a serious audit toolkit. If the question is whether a model has learned a latent objective or a brittle internal pattern, direct access to internals can reveal evidence that behavior-only probing may miss. (anthropic.com)
But the newer evidence makes a narrower claim more defensible than a grand one. Anthropic’s AuditBench write-up says white-box interpretability tools can help, but mostly on easier targets, while scaffolded black-box tools are most effective overall. Anthropic’s February 2026 risk report says the company still lacks enough understanding of internal states to make definitive judgments in isolation, and instead relies on converging evidence. The implication is not anti-interpretability. It is anti-monoculture. The future of white-box auditing is not “interpretability replaces system security.” It is “system security supplies the evidence graph within which interpretability can become decision-relevant.” (alignment.anthropic.com)
Anthropic’s own evaluation practice supports that reading. The risk report says many automated behavioral audit transcripts included the real Claude Code system prompt and tool set used internally. Petri, Anthropic’s open-source automated auditing framework, is described as crafting environments, running multi-turn audits with human-like messages and simulated tools, and scoring transcripts to surface concerning behavior. The Petri 2.0 update added realism mitigations specifically to counter eval-awareness. That is the methodological direction that matters: realistic, system-aware, multi-turn auditing with explicit defenses against the model recognizing it is under test. (anthropic.com)
For defenders, the operational takeaway is clear. If an agent system ships without artifact provenance, policy lineage, runtime traces, and replayable evidence, adding a mechanistic interpretability dashboard later will not fix the core governance gap. Conversely, if the system already records what memory was loaded, what tool was called, which hook changed the arguments, what MCP server responded, what network destinations were contacted, and how the final action was justified, then model-internal tools become much more useful because they can be aligned with concrete events rather than studied in the abstract. (Claude API Docs)

A future white-box architecture for the agentic software stack
A serious post-leak architecture should start from a simple premise: every consequential agent action should be reconstructable as a state transition. Not just the final tool call. The whole transition. That means the evidence model has to capture at least the build artifact, the loaded instruction set, the memory snapshot, the active policy stack, the tool and MCP topology, the runtime trace, and the resulting external effects. If any of those are missing, the audit loses explanatory power. (SLSA)
One workable design is a six-plane architecture.
The first plane is the attested build plane. This plane uses SLSA provenance and in-toto attestations to prove what was built, from what source, by which pipeline, and with which declared steps. In an agentic environment, the subject set has to be broader than a binary or a container. It should include shipped prompts, policy bundles, tool manifests, extension descriptors, and repository-checked integration files such as .mcp.json. If those artifacts affect behavior, they belong in the attestation graph. (SLSA)
The second plane is the policy mediation plane. This plane resolves scope and precedence across managed settings, project settings, local overrides, CLAUDE.md rules, hook logic, permission rules, and sandbox boundaries. Its job is not only to authorize. Its job is to make authorization explainable. When a tool call is allowed, denied, rewritten, or escalated, the plane should emit a policy decision record with the inputs, the matched rules, the decision, and any modified fields. Anthropic’s hook model already provides the control vocabulary needed for this: allow, deny, ask, updated input, additional context, and post-tool output replacement. (Claude API Docs)
The third plane is the execution provenance plane. This is where runtime observability becomes a first-class security control. OpenTelemetry’s GenAI work already defines inference spans, agent spans, and opt-in events that may capture user inputs and model responses. Those conventions are still in development, but they are far enough along to provide a shared vocabulary for conversation IDs, model names, token usage, tool calls, and agent operations. The right design is not to treat observability as a dashboard feature. It is to treat it as the raw material for forensic reconstruction. (GitHub)
A minimal event model for agentic auditing might look like this:
{
"event_id": "evt_01J...",
"ts": "2026-04-01T10:22:14.921Z",
"conversation_id": "conv_92a",
"span_id": "spn_204",
"parent_span_id": "spn_199",
"event_type": "tool_call",
"actor": {
"kind": "subagent",
"name": "browser-tester",
"agent_id": "agent-abc123"
},
"policy": {
"decision_id": "pol_778",
"matched_rule": "allow_webfetch_github_only",
"mode": "allow"
},
"memory": {
"loaded_memory_hash": "sha256:...",
"loaded_claude_md_hashes": ["sha256:..."]
},
"tool": {
"name": "mcp__github__search_repositories",
"arguments_hash": "sha256:...",
"raw_arguments_redacted": true,
"result_hash": "sha256:..."
},
"effects": {
"filesystem_writes": [],
"network_destinations": ["api.github.com"]
}
}
The fourth plane is the replay and counterfactual plane. Incident response for agent systems is not complete when the team can replay the same transcript. It becomes truly useful when the team can ask counterfactual questions. What happens if the memory entry is removed? What if the MCP output is replaced by the unmodified original? What if the task is run without a background subagent? What if the policy stack is rolled back to the previous managed version? A modern white-box audit should be able to replay not just the visible interaction, but the causal environment around it. (Claude)
The fifth plane is the continuous auditor plane. This is where Petri-style behavioral auditing belongs. Anthropic’s public work on Petri describes an automated auditor that crafts environments, runs multi-turn audits, and scores transcripts, while Petri 2.0 adds realism mitigations to reduce eval-awareness. In a production security program, that translates into regression seeds, honeypot tasks, realism filters, suspicious execution motifs, and scenario libraries tied to actual postmortems. The key is that these evaluations should run against the stack the organization actually deploys, not an abstracted model divorced from its tools. (anthropic.com)
The sixth plane is the evidence and adjudication plane. This plane turns traces and attestations into decisions people can use. It should support security triage, engineering debugging, compliance evidence, and external assurance. Anthropic’s own engineering posts on evals and long-running harnesses make a useful point here: teams without evaluations end up in reactive loops, fixing failures only in production and struggling to tell regressions from noise. The same is true for security. If a team cannot compare the current trace shape to a known-good baseline, it will end up arguing about anecdotes instead of proving state changes. (anthropic.com)
| Plane | Primary question | Required evidence | Typical failure if absent |
|---|---|---|---|
| Attested build plane | What exactly was shipped | Provenance, artifact hashes, publish manifest, signed attestations | Source review does not match deployed reality |
| Policy mediation plane | Why was this action allowed | Rule matches, scope resolution, hook decisions, sandbox config | Teams cannot explain authorization or containment failures |
| Execution provenance plane | What happened in order | Spans, events, tool calls, network egress, file effects, conversation IDs | Incidents cannot be reconstructed with confidence |
| Replay and counterfactual plane | What would have happened if X changed | Snapshotting of memory, configs, tool outputs, task graph | Fix validation becomes guesswork |
| Continuous auditor plane | Does the stack still behave safely | Seed scenarios, realism filters, scoring, regressions | Teams learn about drift only after user-facing failures |
| Evidence and adjudication plane | Can humans trust the conclusion | Reports, raw evidence links, diff views, control mappings | Findings do not survive engineering or compliance scrutiny |
This architecture is not academic theater. Every plane in the table is grounded in public supply-chain standards, public Claude Code docs, public MCP specs, or Anthropic’s public work on agent evaluation. What is new is the insistence that these pieces belong together as one auditable system rather than as separate disciplines. (SLSA)
How to instrument the runtime without creating a surveillance mess
A predictable objection is that full-stack observability will capture too much sensitive information. That risk is real. OpenTelemetry’s GenAI documentation explicitly warns that some tool-call fields may contain sensitive information, and its events guidance says user inputs and model responses may be captured as opt-in events rather than as mandatory defaults. That is the correct design instinct. The answer is not “log everything forever.” The answer is selective, structured capture designed around security-relevant causality. (OpenTelemetry)
In practice, that means hashing or tokenizing sensitive tool arguments, separating raw retention from normalized retention, and logging both original and transformed values only where policy mediation can change them. It also means distinguishing between what is required for replay and what is required only for aggregate monitoring. A replay system may need the exact MCP output for a short retention period inside a sealed evidence store. A monitoring pipeline may need only the output hash, tool type, latency, and control outcome. The white-box architecture should support both paths explicitly rather than conflating them. (OpenTelemetry)
An illustrative tracing wrapper might look like this:
from opentelemetry import trace
tracer = trace.get_tracer("agent.audit")
def traced_tool_call(conversation_id, tool_name, args, call_fn):
with tracer.start_as_current_span(f"tool {tool_name}") as span:
span.set_attribute("gen_ai.conversation.id", conversation_id)
span.set_attribute("gen_ai.tool.name", tool_name)
span.set_attribute("agent.audit.args_hash", sha256_json(args))
span.set_attribute("agent.audit.raw_args_retained", False)
try:
result = call_fn(args)
span.set_attribute("agent.audit.result_hash", sha256_json(result))
span.set_attribute("agent.audit.status", "ok")
return result
except Exception as exc:
span.set_attribute("error.type", exc.__class__.__name__)
span.set_attribute("agent.audit.status", "error")
raise
The important property of code like this is not the exact API shape. It is the modeling choice. The trace should bind agent decisions to conversation IDs, policy decisions, tool identities, and stable hashes of inputs and outputs. Once that exists, investigators can correlate a suspicious action to the precise chain that produced it, even if sensitive payloads have been minimized or redacted. That is a far stronger security posture than a pile of ad hoc JSON logs or screenshots of chat transcripts. (GitHub)
From white-box findings to black-box proof
One of the easiest mistakes in this space is to stop after architectural understanding. That is not enough. White-box auditing can tell a team that a risky call chain exists, that a memory boundary is too permissive, that a gateway expands templates unsafely, or that an MCP server reaches farther than intended. But the operational question after that is always the same: is the path reachable under realistic conditions, and does the proposed fix actually hold when the full stack is exercised? (penligent.ai)
That is why the best post-leak workflow is hybrid by design. Use white-box findings to generate a testable exploit or failure hypothesis. Then run black-box or gray-box validation against a realistic environment, with the right identity context, network boundaries, compensating controls, and subagent behavior. White-box review finds latent paths. Black-box proof establishes whether those paths survive contact with reality. The two methods are complements, not competitors. (penligent.ai)
This is one place where Penligent’s public material maps naturally onto the workflow. Its English write-up on Claude Code security frames the distinction cleanly: white-box auditing identifies missing checks across call chains, while black-box validation proves whether an attacker can actually reach a route through gateways and role transitions. On the product side, Penligent’s homepage emphasizes 200-plus supported tools, CVE-to-PoC flow, evidence-first reproducibility, and editable reports aligned with SOC 2 and ISO 27001. Used carefully, that combination fits the exact gap between architectural suspicion and validation evidence. (penligent.ai)
That does not mean every team needs the same product. It means the workflow itself is maturing. Future white-box programs will not end with a PDF that says “possible issue in the tool chain.” They will end with a proof bundle: attestation evidence, trace evidence, exploit reachability evidence, fix diff, replay output, and regression seed. That is the standard AI agent security should be moving toward. (penligent.ai)
The academic research agenda is finally concrete
For researchers, this transition opens a much better set of questions than the vague phrase “AI safety auditing” usually suggests. The first question is trace completeness. What is the minimal event set that makes a post hoc explanation sound enough for security decisions? Too little logging destroys causal clarity. Too much logging destroys privacy and operability. This is not just a systems problem. It is a formalization problem. A useful research program should define which agent-state transitions must be observable for a replay to preserve the security-relevant semantics of the original execution. (OpenTelemetry)
The second question is policy-grounded causality. Existing logging can tell a team that a rule existed. What is harder is proving that the rule changed the outcome. Did the action happen because of a project-scoped CLAUDE.md, because of auto memory, because a PermissionRequest hook rewrote the tool input, or because an MCP output was normalized post hoc? Research here should move beyond generic attribution and toward counterfactual causality over structured agent traces. Anthropic’s hook model, with pre-tool and post-tool decision points, gives a strong public substrate for this kind of work. (Claude)
The third question is selective-disclosure white-box auditing. Anthropic’s public sabotage-risk report says parts of the report were redacted because the unredacted text could raise misuse risk or disclose commercially sensitive information without enough public benefit. That is a realistic constraint for commercial systems. The future will not be universal open source. It will often be verifiable but selectively disclosed systems. That creates room for cryptographic audit mechanisms, attested summaries, third-party evaluator access patterns, and zero-knowledge-style assurance ideas that let outsiders verify specific properties without seeing the entire runtime. (anthropic.com)
The fourth question is memory governance as an auditing science. Anthropic’s docs make memory tangible enough that rigorous benchmarks are now possible. Researchers can compare poisoned and unpoisoned memory corpora, vary scope between project, user, and local, and measure how much a harmful memory entry shifts later tool-use behavior. They can also study whether memory-specific defenses should prioritize provenance, content filtering, decay, conflict resolution, or scope isolation. This is far more measurable than the older and fuzzier question of whether a model “remembers” something in a general sense. (Claude API Docs)
The fifth question is multi-agent provenance. Anthropic’s subagent docs already expose SubagentStart, SubagentStop, agent IDs, agent transcript paths, inherited permissions, and background execution behavior. That means researchers can now work on provenance models for delegated autonomy using real public semantics rather than toy abstractions. How should parent and child traces be merged? What is the right lineage model for shared memory, borrowed permissions, or cross-agent task creation? At what point does provenance become too expensive to preserve at scale? Those are excellent systems-and-security paper questions. (Claude)
The sixth question is tool-to-agent conversion loss. AuditBench’s discussion of a tool-to-agent gap is especially important here. Useful evidence in isolation does not automatically help an investigator agent form the right hypothesis. That same issue will show up in production auditing: a trace may contain the right facts, but the auditor—human or model—may fail to synthesize them into the right causal explanation. Research on better scaffolds, evidence summarization, and graph-based audit interfaces is likely to matter as much as research on the raw tools themselves. (alignment.anthropic.com)
The seventh question is predictive validity. Many controls look strong on paper and weak in operation. Anthropic’s work on evals, long-running harnesses, and realism filtering repeatedly makes the same point: environments, scaffolds, and realism matter. A publish-time provenance control is useful only if it predicts fewer deployment failures. A trace schema is useful only if it predicts faster and more accurate incident resolution. A memory defense is useful only if it predicts lower task-level compromise under contamination. Agentic white-box auditing needs benchmarks that measure those outcomes directly. (anthropic.com)
What teams should actually build next
For most engineering organizations, the next twelve months should not begin with a moonshot interpretability platform. They should begin with disciplined infrastructure. Inventory the actual artifacts that influence agent behavior. Attest the build and publish chain. Centralize policy scopes. Make .mcp.json, hooks, and memory settings reviewable. Add trace identifiers that bind prompts, tools, policy decisions, and effects together. Then build replay for the highest-risk tasks. Only after that foundation exists does it make sense to layer on richer auditor agents, formal causality analysis, or internal-feature tooling. (docs.npmjs.com)
The order matters because the failure modes are unevenly distributed. Teams already know how to debate model behavior in the abstract. They are much less good at answering simple concrete questions like these: which memory entries were loaded, what exact artifact was installed, what policy rule allowed this network call, which MCP server issued this data, and what changed after the patch. The fastest way to improve security is to make those questions easy to answer. (Claude API Docs)
Anthropic’s own public engineering material points in the same practical direction. Good evals make failures visible before they hit users. Harness design is key for long-running autonomous work. Code execution with MCP requires sandboxing, resource limits, and monitoring. Those are not niche observations. They are a blueprint for operational maturity in agent systems. White-box auditing in the post-leak era is simply the discipline of turning that maturity into evidence. (anthropic.com)
The new paradigm is not more source, but more proof
The Claude Code leak became news because it revealed how fragile old assumptions are. If a source map in a package can expose so much of a production coding agent, then the meaningful security boundary was never just the model and never just the repository. It was the full path from authored source to published artifact to loaded context to tool-mediated action. That path is the agentic software stack. (The Verge)
That is the paradigm shift. White-box auditing after the leak is not about reading more leaked code forever. It is about building systems where every consequential agent action can be verified, attributed, replayed, and tested against reality. Build provenance has to be signed. Policies have to be scoped and explainable. Memory has to be auditable. MCP access has to be governable. Runtime traces have to be structured. Findings have to become black-box proof. The teams that adopt that model will have something much better than an AI security slogan. They will have evidence. (SLSA)
Further reading and reference links
Anthropic’s Claude Code documentation is the best public source for understanding the actual runtime surface: the overview, memory system, hooks, subagents, settings scopes, and enterprise MCP deployment guidance all expose the control points that modern white-box auditing has to model. (anthropic.com)
For protocol and observability design, the MCP authorization specification and enterprise-managed authorization extension show how access control is evolving, while OpenTelemetry’s GenAI spans and events show where a shared tracing vocabulary for agent runtimes is starting to solidify. (modelcontextprotocol.io)
For supply-chain rigor, SLSA provenance, in-toto, and the xz backdoor record are essential reading. They make the case that trustworthy software is not just reviewed source; it is verifiable build history and verifiable execution of the supply chain. (SLSA)
For alignment and audit methodology, Anthropic’s work on hidden objectives, AuditBench, the February 2026 risk report, and Petri show how the field is moving toward realistic multi-turn auditing that combines tools, prompts, and runtime conditions rather than treating the model as an isolated object. (anthropic.com)
For topic-adjacent Penligent reading that fits naturally with this workflow, the most relevant internal links are its piece on moving from white-box findings to black-box proof, its article on the Agentic Security Initiative in the MCP era, its write-up on the Claude Code source map leak, and the main product page for evidence-first offensive validation workflows. (penligent.ai)

