Sandboxing AI Agents, Part 2: How to Implement One
Implementation is where most sandbox claims become precise or collapse. This part walks through a rootless Linux design and the checks that keep it honest.
A sandbox implementation is not one mechanism. It is a launch protocol. You prepare a filesystem view, create namespaces, drop privileges, install syscall policy, install resource limits, configure networking, start the workload, observe the workload, and preserve enough evidence to review the result. The order matters because a process that runs before the boundary is complete is not confined by the boundary.
Think of implementation as building a small operating room for one command. The command should not inherit the whole developer laptop. It should receive a prepared directory, a small set of visible system files, a private process view, a controlled network path, a resource budget, and a recorder. The implementation work is making sure those pieces exist before the command gets its first instruction.
Start with a policy type
The first implementation mistake is treating sandbox options as informal flags. Instead, define a policy object with explicit fields and a resolved form. The requested policy is what the user asked for. The resolved policy is what the host proved it can enforce. If the requested claim cannot be resolved, the implementation should refuse rather than silently downgrade.
# Minimum isolation claim. If the host cannot provide it, refuse. isolation = "process" # Read access: the worktree plus read-only runtime files needed by tools. fs.read = ["$WORK", "/usr", "/lib", "/nix"] # Write access: only the disposable environment worktree. fs.write = ["$WORK"] # No outbound network in this profile. net.mode = "deny" net.egress = [] # Resource budget: memory, number of processes, and wall-clock time. limits.memory_mb = 4096 limits.pids = 256 limits.seconds = 900 # No credentials are injected unless explicitly named. secrets = []
A good resolver returns both policy and evidence: Landlock ABI version, whether unprivileged user namespaces are enabled, whether seccomp can install the filter, whether cgroup v2 delegation exists, whether rootless Podman is available, and whether network allowlists can actually be enforced. That evidence should be stored with every run.
# requested: what the user or profile asked for # host: what the current machine can actually enforce resolved = {} if requested.isolation == "process": require(host.has_user_namespaces) require(host.has_mount_namespaces) require(host.has_pid_namespaces) require(host.has_seccomp) require(host.has_landlock) resolved.isolation = "process" if requested.net.egress is not empty: require(host.can_enforce_egress_allowlist) resolved.net.egress = pin_dns_and_addresses(requested.net.egress) if any require(...) failed: refuse("requested sandbox claim cannot be enforced")
Workspace isolation
AI coding agents need a place to make changes. The lowest-cost answer is a separate checkout, usually a Git worktree. This prevents accidental edits to the developer's active tree and gives each agent its own branch, index, and working directory. It is not a security boundary by itself: the process still has the user's host permissions unless a stronger tier is applied.
The major worktree trap is the shared Git object store. A worktree's .git entry points back into the repository's common Git directory. If a confined process can follow that pointer, it may reach refs, hooks, config, and object storage outside the intended workspace. A process-tier sandbox should hide or replace .git, then have the host-side supervisor compute diffs and commit through a path-checked staging path.
~/.ssh, opening the network, or inspecting host processes. Treat it as the file workspace layer, then add execution confinement around commands.
Filesystem confinement
On Linux, the modern unprivileged primitive for per-process filesystem access control is Landlock. Landlock is allowlist-oriented. You grant read and write rights to specific trees. You cannot grant a parent directory and then subtract one child. That single detail shapes the whole design: do not grant the repository root and hope to deny .git; grant the worktree and selected system paths.
A practical file policy usually grants write access to $WORK, read-only access to runtime paths such as /usr, /lib, /lib64, /bin, /etc/ssl, and maybe language-store paths such as /nix. Everything sensitive in the home directory is absent by default. If an agent runtime needs its own state, grant only that runtime's directory and only in a profile that also controls egress.
Filesystem policy must also handle path escape during review. Symlinks, hardlinks, nested Git repositories, submodule gitdirs, and .. traversal all matter. If the host commits changes after the workload exits, every staged path should be canonicalized and rejected when it escapes $WORK. Treat this as part of the sandbox boundary, not as cleanup.
for changed_path in diff_from_worktree():
real_path = canonicalize($WORK / changed_path)
if not real_path.starts_with(canonicalize($WORK)):
reject("path escapes sandbox worktree")
if path_is_nested_gitdir_or_submodule(real_path):
reject("nested repository boundary needs explicit handling")
stage_for_commit(real_path)Process and syscall confinement
Seccomp limits the Linux syscall surface. The strongest model is an allowlist: only syscalls needed by the workload are permitted. Many developer sandboxes start with a denylist because language toolchains need a broad surface and an allowlist takes time to tune. A denylist is still useful when it blocks obvious escalation tools: mount, ptrace, bpf, module loading, keyrings, and dangerous namespace operations.
Seccomp should be installed after PR_SET_NO_NEW_PRIVS. No-new-privs prevents later exec transitions from gaining privilege through setuid binaries or file capabilities. Drop Linux capabilities where possible. Use a private PID namespace so the workload cannot inspect or signal host processes, and mount a private /proc for that namespace rather than exposing the host process table.
| Primitive | Beginner translation | Example denial |
|---|---|---|
| seccomp | Filter the questions a process may ask the kernel. | deny mount() or bpf() |
| no-new-privs | Do not let future execs gain more privilege than this process has now. | setuid helper cannot elevate |
| PID namespace | Show the workload its own small process table. | cannot inspect host /proc |
| capabilities | Split root-like power into smaller switches and turn them off. | no CAP_SYS_ADMIN |
AI agent sandbox network confinement
Network policy has three common modes. Deny creates an empty or loopback-only network namespace. Host leaves networking unrestricted and should be labeled as such. Allowlist permits specific destinations. Allowlist is the most useful mode for agents, but it is also the easiest to implement incorrectly.
A proxy-only allowlist blocks programs that use the proxy. It may not block a raw socket unless the runtime also prevents direct network access. DNS filtering alone is also insufficient: a program can connect to an IP literal, reuse a cached address, or encode data into DNS queries if port 53 is open. A serious egress design needs packet-level enforcement plus name resolution that cannot become a side channel.
One rootless Linux pattern is: create a network namespace, attach a user-space NAT such as slirp4netns, install default-drop nftables rules inside the namespace, resolve allowlisted hostnames at startup, bind a private /etc/hosts, and do not open general DNS. Then block AF_NETLINK after setup so the workload cannot rewrite routes or firewall rules.
The detail to watch is bypass shape. If the policy says "only github.com," then a direct connection to 140.82.112.4 must not work unless that address was pinned for the allowed host. If the policy says "no DNS except pinned names," then a hostname such as secret.attacker.example should not even resolve. If the workload can edit firewall rules after launch, the allowlist is only advisory.
allowlisted host -> resolve once -> pin address -> nftables allow -> connect other hostname -> absent from private hosts file -> no DNS path -> fail raw IP literal -> packet hits default drop -> fail firewall rewrite -> AF_NETLINK denied by socket gate -> fail
Resources and time
Resource limits are part of security because denial of service is a security failure. Use cgroup v2 when available: memory.max, pids.max, CPU weight or quota, and I/O limits if needed. Use RLIMIT_FSIZE to prevent huge files, RLIMIT_CPU as a backstop, and a wall-clock supervisor timer because CPU limits do not catch every hang.
Limits should be visible in the audit log. If a test run failed because the sandbox killed it at 900 seconds or memory pressure terminated a process, the reviewer needs to distinguish that from a semantic test failure.
Secrets
Passing secrets as environment variables is convenient and dangerous. Child processes inherit them. Debug output prints them. /proc can reveal them if namespaces and procfs are wrong. A stronger pattern is a secrets broker: the workload asks for a named secret, the broker checks policy, releases the value only for the intended scope, and redacts matching fingerprints from captured output.
The broker should log the secret name, a fingerprint, the command scope, and whether release was allowed. It should not log the secret value. If the sandbox permits network egress, the policy should be stricter about which secrets are present; a stage with secrets and broad network is the dangerous combination.
Observation and evidence
Every run should produce a capture: command, arguments, working directory, start time, exit status, stdout and stderr pointers, policy digest, resource summary, egress summary, denied actions, and redactions. Large output should be content-addressed outside the model's context window, but recoverable. The summary should be small enough for an agent or reviewer to scan.
This evidence is not only debugging. It is how the sandbox becomes reviewable. A diff produced under a weak policy should not be reviewed the same way as a diff produced under deny-network confinement. A run with blocked raw-IP egress deserves more scrutiny than a run that only compiled.
Launch order
The launch sequence is easiest to understand as "host prepares, child enters, child loses power, workload starts." Anything that requires broad authority must happen before the untrusted command runs. Anything the command could use to escape must be removed or filtered before exec.
- Create or select the isolated workspace and freeze the base revision.
- Resolve policy against host capabilities; refuse if the claim cannot be enforced.
- Prepare filesystem view, hiding shared Git state and sensitive host paths.
- Create user, mount, PID, and network namespaces as required.
- Set up network policy before the workload starts.
- Install resource limits and cgroup membership.
- Set no-new-privs, drop capabilities, and install seccomp or supervisor gates.
- Start the workload and capture output, denials, resources, and egress decisions.
- After exit, compute diff from the workspace filesystem using escape-checked paths.
- Store policy, evidence, and diff together so review can reconstruct the run.
manifest = create_env_manifest(base_commit, requested_policy)
resolved = resolve_or_refuse(requested_policy, host_capabilities())
work = create_git_worktree(base_commit)
fs_view = prepare_mount_view(work, resolved.fs)
network = prepare_network_namespace(resolved.net)
resources = create_cgroup(resolved.limits)
child = fork()
if child == 0:
enter_user_mount_pid_network_namespaces()
mount_private_proc()
join_cgroup(resources)
apply_landlock(resolved.fs)
set_no_new_privs()
install_seccomp(resolved.syscalls)
exec(command)
capture_exit_status_output_denials()
diff = compute_escape_checked_diff(work)
store_evidence(manifest, resolved, capture, diff)Common implementation bugs
| Bug | Why it matters | Fix |
|---|---|---|
| Silent downgrade | User asks for network allowlist; host runs unrestricted | fail closed and print missing capability |
| Grant repo root to Landlock | .git and hooks become reachable | grant worktree only; host mediates commits |
| Proxy-only network policy | Raw sockets bypass host allowlist | combine proxy with netns packet filtering or deny direct sockets |
Host /proc mounted | Secrets and host process metadata leak | private PID namespace and private procfs |
| Secrets in argv | Process list and logs expose values | broker or file descriptor handoff; redact captures |
| No path canonicalization on apply | Symlink escape can write outside worktree | canonicalize and reject escaped paths |
Testing a sandbox
Unit tests should exercise policy parsing, path normalization, and resolver fail-closed cases. Integration tests should attempt real denied operations: read ~/.ssh, write outside $WORK, connect to a raw IP, resolve an off-allowlist hostname, open AF_NETLINK, inspect host /proc/1/environ, fork past the PID limit, and fill a file past the file-size limit. Good tests prove both the refusal and the evidence record.
The most useful negative tests are boring. They assert that a command fails in the way the security model predicts. If a sandbox cannot test its own denials, reviewers have to trust prose.
Implementation decides the claim
h5i's sandbox work is designed around fail-closed resolution, layered confinement, and reviewable captures.
Star on GitHub Read part 1