Sandbox Series · Part 2 · 2026-06-12

Sandboxing AI Agents, Part 2: How to Implement One

Implementation is where most sandbox claims become precise or collapse. This part walks through a rootless Linux design and the checks that keep it honest.

A sandbox implementation is not one mechanism. It is a launch protocol. You prepare a filesystem view, create namespaces, drop privileges, install syscall policy, install resource limits, configure networking, start the workload, observe the workload, and preserve enough evidence to review the result. The order matters because a process that runs before the boundary is complete is not confined by the boundary.

Think of implementation as building a small operating room for one command. The command should not inherit the whole developer laptop. It should receive a prepared directory, a small set of visible system files, a private process view, a controlled network path, a resource budget, and a recorder. The implementation work is making sure those pieces exist before the command gets its first instruction.

Series map. Review the foundations, then continue to sandbox comparison and h5i's env design.

Start with a policy type

The first implementation mistake is treating sandbox options as informal flags. Instead, define a policy object with explicit fields and a resolved form. The requested policy is what the user asked for. The resolved policy is what the host proved it can enforce. If the requested claim cannot be resolved, the implementation should refuse rather than silently downgrade.

policy.toml
# Minimum isolation claim. If the host cannot provide it, refuse.
isolation = "process"

# Read access: the worktree plus read-only runtime files needed by tools.
fs.read   = ["$WORK", "/usr", "/lib", "/nix"]

# Write access: only the disposable environment worktree.
fs.write  = ["$WORK"]

# No outbound network in this profile.
net.mode  = "deny"
net.egress = []

# Resource budget: memory, number of processes, and wall-clock time.
limits.memory_mb = 4096
limits.pids = 256
limits.seconds = 900

# No credentials are injected unless explicitly named.
secrets = []

A good resolver returns both policy and evidence: Landlock ABI version, whether unprivileged user namespaces are enabled, whether seccomp can install the filter, whether cgroup v2 delegation exists, whether rootless Podman is available, and whether network allowlists can actually be enforced. That evidence should be stored with every run.

policy resolver pseudo-code
# requested: what the user or profile asked for
# host: what the current machine can actually enforce

resolved = {}

if requested.isolation == "process":
    require(host.has_user_namespaces)
    require(host.has_mount_namespaces)
    require(host.has_pid_namespaces)
    require(host.has_seccomp)
    require(host.has_landlock)
    resolved.isolation = "process"

if requested.net.egress is not empty:
    require(host.can_enforce_egress_allowlist)
    resolved.net.egress = pin_dns_and_addresses(requested.net.egress)

if any require(...) failed:
    refuse("requested sandbox claim cannot be enforced")

Workspace isolation

AI coding agents need a place to make changes. The lowest-cost answer is a separate checkout, usually a Git worktree. This prevents accidental edits to the developer's active tree and gives each agent its own branch, index, and working directory. It is not a security boundary by itself: the process still has the user's host permissions unless a stronger tier is applied.

The major worktree trap is the shared Git object store. A worktree's .git entry points back into the repository's common Git directory. If a confined process can follow that pointer, it may reach refs, hooks, config, and object storage outside the intended workspace. A process-tier sandbox should hide or replace .git, then have the host-side supervisor compute diffs and commit through a path-checked staging path.

Worktree is not sandbox. A Git worktree protects your active checkout from messy edits. It does not stop a process from reading ~/.ssh, opening the network, or inspecting host processes. Treat it as the file workspace layer, then add execution confinement around commands.

Filesystem confinement

On Linux, the modern unprivileged primitive for per-process filesystem access control is Landlock. Landlock is allowlist-oriented. You grant read and write rights to specific trees. You cannot grant a parent directory and then subtract one child. That single detail shapes the whole design: do not grant the repository root and hope to deny .git; grant the worktree and selected system paths.

A practical file policy usually grants write access to $WORK, read-only access to runtime paths such as /usr, /lib, /lib64, /bin, /etc/ssl, and maybe language-store paths such as /nix. Everything sensitive in the home directory is absent by default. If an agent runtime needs its own state, grant only that runtime's directory and only in a profile that also controls egress.

Filesystem policy must also handle path escape during review. Symlinks, hardlinks, nested Git repositories, submodule gitdirs, and .. traversal all matter. If the host commits changes after the workload exits, every staged path should be canonicalized and rejected when it escapes $WORK. Treat this as part of the sandbox boundary, not as cleanup.

path escape check
for changed_path in diff_from_worktree():
    real_path = canonicalize($WORK / changed_path)

    if not real_path.starts_with(canonicalize($WORK)):
        reject("path escapes sandbox worktree")

    if path_is_nested_gitdir_or_submodule(real_path):
        reject("nested repository boundary needs explicit handling")

    stage_for_commit(real_path)

Process and syscall confinement

Seccomp limits the Linux syscall surface. The strongest model is an allowlist: only syscalls needed by the workload are permitted. Many developer sandboxes start with a denylist because language toolchains need a broad surface and an allowlist takes time to tune. A denylist is still useful when it blocks obvious escalation tools: mount, ptrace, bpf, module loading, keyrings, and dangerous namespace operations.

Seccomp should be installed after PR_SET_NO_NEW_PRIVS. No-new-privs prevents later exec transitions from gaining privilege through setuid binaries or file capabilities. Drop Linux capabilities where possible. Use a private PID namespace so the workload cannot inspect or signal host processes, and mount a private /proc for that namespace rather than exposing the host process table.

PrimitiveBeginner translationExample denial
seccompFilter the questions a process may ask the kernel.deny mount() or bpf()
no-new-privsDo not let future execs gain more privilege than this process has now.setuid helper cannot elevate
PID namespaceShow the workload its own small process table.cannot inspect host /proc
capabilitiesSplit root-like power into smaller switches and turn them off.no CAP_SYS_ADMIN
Denylist seccomp is not a hostile-code proof. It reduces known-dangerous surface. It does not prove that every permitted syscall is safe against a future kernel bug. For malicious code with a kernel exploit budget, use a separate-kernel boundary such as a microVM.

AI agent sandbox network confinement

Network policy has three common modes. Deny creates an empty or loopback-only network namespace. Host leaves networking unrestricted and should be labeled as such. Allowlist permits specific destinations. Allowlist is the most useful mode for agents, but it is also the easiest to implement incorrectly.

A proxy-only allowlist blocks programs that use the proxy. It may not block a raw socket unless the runtime also prevents direct network access. DNS filtering alone is also insufficient: a program can connect to an IP literal, reuse a cached address, or encode data into DNS queries if port 53 is open. A serious egress design needs packet-level enforcement plus name resolution that cannot become a side channel.

One rootless Linux pattern is: create a network namespace, attach a user-space NAT such as slirp4netns, install default-drop nftables rules inside the namespace, resolve allowlisted hostnames at startup, bind a private /etc/hosts, and do not open general DNS. Then block AF_NETLINK after setup so the workload cannot rewrite routes or firewall rules.

The detail to watch is bypass shape. If the policy says "only github.com," then a direct connection to 140.82.112.4 must not work unless that address was pinned for the allowed host. If the policy says "no DNS except pinned names," then a hostname such as secret.attacker.example should not even resolve. If the workload can edit firewall rules after launch, the allowlist is only advisory.

egress decision model
allowlisted host  -> resolve once -> pin address -> nftables allow -> connect
other hostname    -> absent from private hosts file -> no DNS path -> fail
raw IP literal    -> packet hits default drop -> fail
firewall rewrite  -> AF_NETLINK denied by socket gate -> fail

Resources and time

Resource limits are part of security because denial of service is a security failure. Use cgroup v2 when available: memory.max, pids.max, CPU weight or quota, and I/O limits if needed. Use RLIMIT_FSIZE to prevent huge files, RLIMIT_CPU as a backstop, and a wall-clock supervisor timer because CPU limits do not catch every hang.

Limits should be visible in the audit log. If a test run failed because the sandbox killed it at 900 seconds or memory pressure terminated a process, the reviewer needs to distinguish that from a semantic test failure.

Secrets

Passing secrets as environment variables is convenient and dangerous. Child processes inherit them. Debug output prints them. /proc can reveal them if namespaces and procfs are wrong. A stronger pattern is a secrets broker: the workload asks for a named secret, the broker checks policy, releases the value only for the intended scope, and redacts matching fingerprints from captured output.

The broker should log the secret name, a fingerprint, the command scope, and whether release was allowed. It should not log the secret value. If the sandbox permits network egress, the policy should be stricter about which secrets are present; a stage with secrets and broad network is the dangerous combination.

Observation and evidence

Every run should produce a capture: command, arguments, working directory, start time, exit status, stdout and stderr pointers, policy digest, resource summary, egress summary, denied actions, and redactions. Large output should be content-addressed outside the model's context window, but recoverable. The summary should be small enough for an agent or reviewer to scan.

This evidence is not only debugging. It is how the sandbox becomes reviewable. A diff produced under a weak policy should not be reviewed the same way as a diff produced under deny-network confinement. A run with blocked raw-IP egress deserves more scrutiny than a run that only compiled.

Launch order

The launch sequence is easiest to understand as "host prepares, child enters, child loses power, workload starts." Anything that requires broad authority must happen before the untrusted command runs. Anything the command could use to escape must be removed or filtered before exec.

  1. Create or select the isolated workspace and freeze the base revision.
  2. Resolve policy against host capabilities; refuse if the claim cannot be enforced.
  3. Prepare filesystem view, hiding shared Git state and sensitive host paths.
  4. Create user, mount, PID, and network namespaces as required.
  5. Set up network policy before the workload starts.
  6. Install resource limits and cgroup membership.
  7. Set no-new-privs, drop capabilities, and install seccomp or supervisor gates.
  8. Start the workload and capture output, denials, resources, and egress decisions.
  9. After exit, compute diff from the workspace filesystem using escape-checked paths.
  10. Store policy, evidence, and diff together so review can reconstruct the run.
launch pseudo-code
manifest = create_env_manifest(base_commit, requested_policy)
resolved = resolve_or_refuse(requested_policy, host_capabilities())

work = create_git_worktree(base_commit)
fs_view = prepare_mount_view(work, resolved.fs)
network = prepare_network_namespace(resolved.net)
resources = create_cgroup(resolved.limits)

child = fork()
if child == 0:
    enter_user_mount_pid_network_namespaces()
    mount_private_proc()
    join_cgroup(resources)
    apply_landlock(resolved.fs)
    set_no_new_privs()
    install_seccomp(resolved.syscalls)
    exec(command)

capture_exit_status_output_denials()
diff = compute_escape_checked_diff(work)
store_evidence(manifest, resolved, capture, diff)

Common implementation bugs

BugWhy it mattersFix
Silent downgradeUser asks for network allowlist; host runs unrestrictedfail closed and print missing capability
Grant repo root to Landlock.git and hooks become reachablegrant worktree only; host mediates commits
Proxy-only network policyRaw sockets bypass host allowlistcombine proxy with netns packet filtering or deny direct sockets
Host /proc mountedSecrets and host process metadata leakprivate PID namespace and private procfs
Secrets in argvProcess list and logs expose valuesbroker or file descriptor handoff; redact captures
No path canonicalization on applySymlink escape can write outside worktreecanonicalize and reject escaped paths

Testing a sandbox

Unit tests should exercise policy parsing, path normalization, and resolver fail-closed cases. Integration tests should attempt real denied operations: read ~/.ssh, write outside $WORK, connect to a raw IP, resolve an off-allowlist hostname, open AF_NETLINK, inspect host /proc/1/environ, fork past the PID limit, and fill a file past the file-size limit. Good tests prove both the refusal and the evidence record.

The most useful negative tests are boring. They assert that a command fails in the way the security model predicts. If a sandbox cannot test its own denials, reviewers have to trust prose.

Implementation decides the claim

h5i's sandbox work is designed around fail-closed resolution, layered confinement, and reviewable captures.

Star on GitHub Read part 1