Sandbox Series · Part 1 · 2026-06-12

Sandboxing AI Agents, Part 1: Foundations

A sandbox is not a magic box. It is a carefully stated security claim: this process may do these things, may not do those things, and there is evidence when it presses against the boundary.

By Koukyosyumei Reading time 15 min Tags Sandbox · Threat model · AI agents

An AI coding agent is a strange security subject. It is not exactly malware, because you invited it in. It is not exactly a normal developer tool, because it reads untrusted project text and then chooses commands. It is not exactly a compiler, because it can browse your repository, run package managers, inspect environment variables, invoke shells, and decide that the next step is a network request. Sandboxing is the discipline of making that power conditional.

Series map. Start here for the AI agent sandbox threat model, then read implementation, sandbox comparison, and h5i's env design.

The basic definition is simple: a sandbox is an enforcement boundary around a workload. The boundary gives the workload some capabilities and withholds others. A good sandbox answers four questions: what object is confined, which operations are allowed, who enforces the rule, and how violations are observed. If any of those are vague, the word "sandbox" is marketing rather than a security property.

If you are new to operating-system security, read "capability" as "permission to do one concrete thing." A process does not need a philosophical right to "use the machine." It needs specific handles: read this directory, write that checkout, open this network connection, create at most this many child processes, and run for this much time. A sandbox is the machinery that makes those handles smaller than the user's normal shell.

Layered AI agent sandbox architecture diagram — A useful AI-agent sandbox is layered. File isolation without network control still permits exfiltration. Network control without provenance still leaves review blind.

A plain-language mental model

Imagine lending a contractor a room in your office. A weak arrangement says "please do not open the file cabinets." A sandboxed arrangement changes the room: the cabinets are not there, the phone can only dial approved numbers, the power strip has a breaker, and the door log records who entered and what they carried out. The contractor may still do useful work, but the room no longer has the full authority of the building.

For software, the "room" is the process environment. Filesystem rules decide which cabinets exist. Network rules decide which numbers can be dialed. Process rules decide whether the workload can inspect or signal other programs. Resource rules decide how much power it can draw. Audit rules decide what evidence remains after the work finishes.

AI agent sandbox threat model

Sandboxing starts with an adversary model, not with a tool. For AI agents, the most common adversary is not a person typing exploit code directly. It is a chain: repository content, dependency metadata, test output, package install scripts, web pages, generated code, and the model's own tool choices. Any of those can cause the agent to execute a command that the human never intended.

A practical threat model has three levels. First is accidental damage: a command writes outside the project, deletes state, consumes all memory, or sends data to the wrong host. Second is prompt-injected behavior: untrusted text instructs the agent to read secrets, bypass policy, or exfiltrate evidence. Third is hostile code execution: a dependency, test, binary, or generated program intentionally attacks the sandbox itself.

Those levels require different boundaries. Worktree isolation handles accidental file collisions between parallel agents. Kernel isolation handles ordinary process abuse. Network egress policy handles exfiltration. MicroVM isolation handles the uncomfortable case where the workload may be carrying a kernel exploit. One tier cannot honestly claim all of them unless it actually implements all of them.

Permissions prompts are not sandboxing. A prompt asks the user before a tool call. A sandbox enforces a rule after the user, model, dependency, and shell have all made mistakes. Prompts are useful workflow controls; they are not a containment boundary.

Capabilities, not vibes

A sandbox policy is best understood as a capability table. The process may read these paths, write those paths, open these sockets, fork this many children, allocate this much memory, and run for this much time. Everything else is denied. This is the same mental model as object capabilities: possession of a handle authorizes a specific operation, and absence of the handle means no ambient permission.

Ambient authority is the default danger of a developer shell. Your shell can usually read your SSH keys, cloud credentials, browser profile, package-manager tokens, local databases, and the full repository. When an agent inherits that shell, the agent inherits that authority. A sandbox tries to replace ambient authority with named authority.

Capability	Bad default	Sandboxed form	Failure prevented
Filesystem read	home directory	project worktree plus selected read-only system paths	secret discovery
Filesystem write	any path user can write	worktree only	host damage, dotfile poisoning
Network	full internet	deny, or explicit egress allowlist	data exfiltration, callback channels
Process	same PID space	private PID namespace, signal limits	host process inspection or signaling
Resources	host user limits	cgroups, rlimits, timeout	runaway builds, fork bombs, disk fill
Privilege	user's normal privilege surface	no-new-privs, dropped caps, syscall policy	privilege escalation paths

Core sandbox primitives

The names in sandbox diagrams are easy to skim past, but each one protects a different axis of authority. Landlock is a Linux security module that lets an unprivileged process restrict its own filesystem access with allowlists: for example, read these system directories and write only this worktree. seccomp is a Linux syscall filter. It can deny or trap kernel calls such as mount, ptrace, bpf, or socket families the workload should not use.

Namespaces give a process a private view of some kernel resources. A PID namespace hides host processes, a mount namespace gives a private filesystem layout, a user namespace maps privilege inside the sandbox without giving host root, and a network namespace gives the workload its own network stack. cgroups are Linux resource controllers: they bound memory, process count, CPU, and sometimes I/O so a build, test, or generated program cannot consume the whole machine.

Egress means outbound communication from the sandbox to something else. In agent security, egress usually means network egress: can the workload contact the internet, package registries, model APIs, internal services, or raw IP addresses? A deny-network sandbox removes the channel. An egress allowlist permits only named destinations. This matters because filesystem isolation does not stop data theft if the process can still send the data somewhere.

primitive cheat sheet

Landlock    # file tree rules: this process may read/write only these paths
seccomp     # syscall rules: this process may not ask the kernel for dangerous operations
namespace   # private view: this process sees its own /proc, mounts, network, or users
cgroup      # resource budget: this process group gets bounded memory, CPU, PIDs, I/O
egress      # outbound channel: this process may contact only these destinations

Isolation versus mediation

There are two broad enforcement styles. Isolation changes what the workload can see: a private filesystem view, a private PID namespace, a private network namespace, or a whole virtual machine. Mediation leaves the world conceptually visible but inserts a decision point: a seccomp-notify supervisor, an HTTP proxy, a broker for secrets, or a file-open policy engine.

Isolation is simple to reason about when it is complete. If there is no network device, the process cannot send packets. Mediation is more flexible when decisions depend on runtime state. A proxy can allow pypi.org but deny paste.example; a secrets broker can release a token only to one command and redact it from logs. Most real sandboxes combine both styles.

Question	Isolation answer	Mediation answer
Can it read `~/.ssh`?	Do not mount or allow that path.	Ask a file broker whether this open is allowed.
Can it reach the internet?	Give it no network namespace route.	Route traffic through an allowlist proxy or socket gate.
Can it use a secret?	Do not put the secret in the environment.	Release a named secret through a broker and redact logs.
Can it fork forever?	Put it in a cgroup with a PID budget.	Supervisor kills the run when budget is exceeded.

The AI-specific problem

AI agents create a new composition problem. A normal sandbox for untrusted code asks: can this program harm the host? An agent sandbox must also ask: can this program influence the agent into using its tools against the host? That is why repository text matters. A README can tell the model to run a curl command. A test failure can print a "fix" that copies credentials. A dependency script can alter files that the agent later reads as trusted context.

The hard case is the "lethal trifecta": private data, untrusted content, and network egress in the same decision context. If a process can read secrets, read attacker-controlled instructions, and talk to the internet, sandbox policy has already given the attack all three ingredients. Strong designs split stages: fetch untrusted content without secrets, process secrets without network, and publish only sanitized output.

safe stage separation

# Stage 1: fetch public/untrusted input, but no secrets are present.
fetch_untrusted:  network = allow, secrets = none

# Stage 2: process private data, but no network exists.
process_private:  network = deny,  secrets = needed

# Stage 3: publish a sanitized result to a narrow destination.
publish_result:   network = allowlist, secrets = publish-token only

What sandboxes defend

A sandbox can defend against accidental writes, opportunistic exfiltration, many prompt-injection payloads, dependency scripts that expect a normal home directory, and a large class of process-level attacks. It can make risky work auditable and reversible. It can let several agents run at once without clobbering each other's checkouts. It can make a raw mistake cheap.

It cannot make arbitrary hostile code safe merely by naming itself a sandbox. Shared-kernel isolation still shares the host kernel. If the workload can reach a vulnerable kernel surface through an allowed syscall, a kernel exploit can become a host escape. Containers, Landlock, seccomp, and namespaces are important, but they are not the same category as a microVM with a separate guest kernel.

Security claims must name their ceiling. "Prevents accidental damage" is a different claim from "contains malicious code" and a different claim again from "safe for mutually hostile tenants." Honest sandbox documentation should say which one it means.

A simple taxonomy

Sandboxes for agents fall into five families. A worktree sandbox isolates edits but not execution. A process sandbox uses host-kernel primitives such as namespaces, seccomp, Landlock, Seatbelt, or bubblewrap. A container sandbox packages process isolation with an image and runtime. A user-space kernel such as gVisor intercepts much of the Linux ABI before it reaches the host kernel. A microVM such as Firecracker runs the workload behind a guest kernel and hypervisor boundary.

The categories are not a strict ladder for every workload. Worktrees are excellent for merge workflow and terrible as a security boundary. Process sandboxes are fast and local but share the kernel. Containers are ergonomic but frequently misconfigured. gVisor and Kata improve the kernel boundary at operational cost. MicroVMs raise the isolation ceiling but require VM images, KVM, boot plumbing, and careful network and file integration.

The audit requirement

AI-agent sandboxing is not only prevention. It is also evidence. Reviewers need to know which command ran, under which policy, with which denied operations, which secrets were released, which hosts were contacted, which files changed, and whether the policy was silently downgraded. A sandbox that blocks actions but leaves no reviewable record is useful for containment but weak for engineering governance.

This is where agent sandboxes differ from many classical sandboxes. The output is not just "program returned 0." The output is a proposed code change. The reviewer needs the diff and the story: why this environment existed, what the agent tried, what evidence it saw, and whether any boundary pressure appeared. That is why later parts of this series treat provenance as part of the sandbox rather than an afterthought.

Checklist for evaluating a sandbox

State the adversary: accident, prompt injection, hostile code, or hostile tenant.
List the exact capabilities granted to the workload.
Identify the enforcement primitive for each capability.
Ask whether the workload can modify the policy after start.
Ask whether DNS, raw IP sockets, proxies, and Unix sockets match the network claim.
Ask whether secrets are absent, brokered, or simply passed as environment variables.
Check whether resource limits are kernel-enforced or best-effort.
Check whether denied actions are recorded in a form reviewers can inspect.
Check whether the tool fails closed when a requested boundary cannot be enforced.

Sandboxing AI Agents, Part 1: Foundations

A plain-language mental model

AI agent sandbox threat model

Capabilities, not vibes

Core sandbox primitives

Isolation versus mediation

The AI-specific problem

What sandboxes defend

A simple taxonomy

The audit requirement

Checklist for evaluating a sandbox

Further reading

How to Implement a Sandbox

Sandboxing is an engineering claim