What is an AI agent code execution sandbox?

An isolated runtime environment - typically a microVM or container - that executes agent-generated code safely. It provides filesystem isolation, network controls, and resource limits so an agent can run arbitrary Python or shell commands without compromising the host or other tenants.

Can Firecracker microVMs access GPUs?

Yes, with PCIe passthrough or VFIO on bare metal hosts. Serverless GPU platforms like Modal or RunPod serverless restrict hypervisor access, so GPU passthrough requires a bare metal or dedicated instance where you control the host kernel.

How does E2B compare to self-hosting Firecracker on Spheron?

E2B managed pricing starts around $0.000014/sandbox/second (roughly $25/month for 1000 daily 60-second sandboxes). Self-hosting Firecracker on a Spheron H100 instance runs the hypervisor layer yourself - upfront ops work, but per-execution cost drops by 60-80% at sustained GPU-enabled volume above ~500 sandbox-hours/month. For CPU-only workloads, the self-hosted break-even vs managed E2B is at roughly 11,000 sandbox-hours/month.

What is cold start latency for Firecracker microVMs?

A cold Firecracker boot takes 125-200ms. With snapshot-restore (pre-booted snapshots) that drops to 5-30ms per sandbox. E2B uses this technique by default; for self-hosted deployments you need to pre-generate snapshots and warm a pool.

How do you handle GPU access across hundreds of concurrent sandboxes?

Three patterns: (1) MIG partitioning splits a single H100 into up to 7 isolated GPU slices, each with dedicated VRAM and bandwidth - best for compute-heavy sandboxes; (2) time-slicing shares one physical GPU across many sandboxes with context switching - best for bursty, short-duration executions; (3) dedicated GPU per sandbox for SWE-bench / full-model workloads requiring full GPU memory.

AI Agent Code Execution Sandboxes on GPU Cloud: E2B, Daytona, and Firecracker Setup Guide (2026)

An agent that can generate Python is only useful if it can actually run it safely. Most agent stacks handle LLM calls well, but isolating arbitrary agent-generated code at scale with GPU access is where teams hit a wall. This post covers the full isolation stack: Firecracker microVM self-hosting, E2B OSS deployment, Daytona's gVisor approach, GPU passthrough patterns, and per-execution cost math at production scale.

For the foundational GPU sizing and VRAM budgeting for agent workloads, the GPU infrastructure requirements for AI agents guide is the right starting point. This post focuses specifically on the isolated execution layer that sits below your LLM calls.

Why Agents Need Code Execution Sandboxes

Security: agents run untrusted code

Agent-generated code can exfiltrate environment variables, write to disk, open outbound connections, and escape to the host if not properly contained. This is not theoretical. SWE-bench agent harnesses, OpenDevin, and Claude's computer use mode all document this as the core threat model: the agent's code generation is trusted output from the LLM, but its runtime behavior is not.

The two kernel-level primitives that matter are syscall filtering (seccomp) and filesystem access restrictions (Landlock LSM or similar). Container isolation adds a third layer but is not sufficient on its own for multi-tenant deployments where tenants cannot trust each other. For agents that need policy-based syscall governance on the GPU side, see the NVIDIA OpenShell deployment guide, which covers NemoClaw's four-layer security stack. For teams who need to go deeper on the execution layer itself, this post covers what sits below that policy layer.

GPU access: ML agents need accelerated compute inside sandboxes

Code interpreters running PyTorch, JAX, or CUDA kernels inside a sandbox need real GPU access, not CPU-only environments. This separates production AI agent platforms from web-only sandbox use cases. A data analysis agent that calls torch.cuda.is_available() and gets False is broken. An SWE-bench harness that runs model training steps inside the agent's execution environment needs full GPU passthrough, not a CPU fallback.

This GPU requirement is what makes the choice of isolation layer non-trivial. gVisor's user-space kernel intercepts GPU calls at a point that blocks direct PCIe passthrough. Firecracker's hardware virtualization path supports VFIO device passthrough to the microVM, giving the sandbox real GPU access with near-native performance.

Persistent state: workspaces survive multi-turn agent sessions

Multi-turn agent sessions need filesystem state that persists across turns. An agent working on a Python project across 10 turns has installed packages, written files, and accumulated intermediate outputs. Full sandbox re-initialization on every turn wastes 200-500ms on environment setup. Firecracker's snapshot-restore mechanism lets you pause a sandbox, preserve its memory and filesystem state, and resume it in 5-30ms. This is the operational foundation for SWE-bench-style agent harnesses where context accumulates over dozens of tool calls.

Sandbox Architecture Comparison

Sandbox	Isolation layer	Cold start	GPU support	Self-hostable	Starting price
E2B (managed)	Firecracker microVM	5-30ms (snapshot)	No (managed tier)	Yes (OSS)	$0.000014/sec
E2B OSS (self-hosted)	Firecracker microVM	5-30ms	Yes (bare metal)	Yes	Host cost only
Daytona	Docker + gVisor	200-500ms	Limited (no MIG)	Yes	OSS / enterprise
Modal Sandboxes	gVisor (runsc)	100-300ms	Yes (T4/A10G)	No	$0.0001/sec compute
Replit Agent Runtime	Container (proprietary)	~1s	No	No	Closed

When to use each:

E2B managed is the right default for early-stage agent platforms. Zero infra ops, sub-30ms cold starts from snapshots, Python SDK, and REST API. The limitation is GPU: managed E2B runs on CPU-only sandbox hosts. If your agents run PyTorch inside the sandbox, you need OSS.

E2B OSS on bare metal is the right choice when you need GPU access inside sandboxes, have GPU-enabled sandbox volume above roughly 500 sandbox-hours/month, and have the ops capacity to run a small cluster. (For CPU-only workloads, managed E2B is cheaper until you hit roughly 11,000 sandbox-hours/month.) The architecture is identical to managed; you just host the Firecracker nodes yourself.

Daytona targets developer workspace use cases. Its gVisor layer provides strong isolation but blocks GPU passthrough. Good for code-gen agents that only need CPU, not for ML-heavy sandboxes.

Modal Sandboxes offer GPU access (T4, A10G) without the ops overhead. Higher per-second cost than self-hosted but lower than E2B managed for GPU workloads. No H100 or B200 option, and no self-hosting path.

Full GPU passthrough with per-execution cost control at volume: Firecracker on bare metal wins. If you need H100 SXM5 inside the sandbox and care about per-execution cost above 10K executions/month, this is the path.

Self-Hosting Firecracker microVMs on Spheron Bare Metal

Why bare metal is required

Firecracker needs KVM access. Cloud VMs with nested virtualization add latency (5-15ms overhead per VM operation), reduce isolation guarantees, and critically block PCIe passthrough. Nested virtualization exposes a software-emulated KVM interface, not the real hardware VMX/SVM instructions. This is sufficient for running Firecracker as a development environment but disqualifies it for production GPU passthrough.

Serverless GPU platforms (Modal, RunPod serverless) run on shared host infrastructure where you cannot bind VFIO devices. The platform's host kernel owns the GPU and exposes it through the NVIDIA container toolkit. VFIO-PCI binding, which is required for passing a PCIe device through to a microVM, requires root access to the host.

Spheron bare metal H100 instances give you unmediated access to the KVM subsystem and PCIe bus, a requirement for running Firecracker with real GPU passthrough. Managed serverless platforms abstract this away.

Verify KVM availability after provisioning:

bash

ls /dev/kvm
# Should return: /dev/kvm

If /dev/kvm is absent, confirm the instance type is bare metal and that kvm and kvm_intel (or kvm_amd) kernel modules are loaded.

Network setup for sandboxes

Each Firecracker microVM needs a TAP network interface for communication. On a multi-sandbox host, you bridge all TAP devices and assign each sandbox a /30 subnet for isolation.

bash

# Create a bridge for sandbox networking
ip link add br0 type bridge
ip link set br0 up
ip addr add 10.100.0.1/16 dev br0

# Create a TAP device per sandbox (repeat for each)
ip tuntap add tap0 mode tap
ip link set tap0 master br0
ip link set tap0 up

# Assign /30 subnet to sandbox (example for sandbox 0)
# Host side: 10.100.0.1, Sandbox: 10.100.0.2, Broadcast: 10.100.0.3

For egress, configure masquerade NAT on the host's external interface and set iptables rules per sandbox to allow or block outbound destinations:

bash

# Allow outbound from all sandboxes through host
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i br0 -o eth0 -j ACCEPT

# Per-sandbox egress restriction (block all non-allowlisted destinations)
# Use physdev module because tap0 is enslaved to br0; the FORWARD chain sees br0 as ingress, not tap0
iptables -I FORWARD -i br0 -m physdev --physdev-in tap0 -o eth0 -j DROP
iptables -I FORWARD -i br0 -m physdev --physdev-in tap0 -d 8.8.8.8 -o eth0 -j ACCEPT

GPU passthrough with VFIO-PCI

VFIO-PCI passthrough lets you bind a physical GPU to a microVM with near-native performance. The host must have IOMMU enabled (VT-d on Intel, AMD-Vi on AMD), configured in BIOS before boot. The exact commands vary by kernel version and NVIDIA driver; this procedure targets Linux kernel 5.15+ and NVIDIA driver 535+. The A100 80GB GPU instances are a common starting point before moving to H100s at scale.

Step 1: Identify the GPU's PCI address

bash

lspci | grep NVIDIA
# Example output: 01:00.0 3D controller: NVIDIA Corporation A100 80GB PCIe [GA100]

Step 2: Bind the GPU to VFIO-PCI

bash

# Load VFIO modules
modprobe vfio
modprobe vfio-pci

# Get GPU vendor:device ID
lspci -n -s 01:00.0
# Example: 01:00.0 0302: 10de:20b5 (rev a1)

# Unbind from nvidia driver
echo "01:00.0" > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

# Bind to vfio-pci
echo "10de 20b5" > /sys/bus/pci/drivers/vfio-pci/new_id

Step 3: Configure the Firecracker microVM with the PCIe device

Firecracker uses a JSON configuration file. Add the GPU as a vfio device:

json

{
  "machine-config": {
    "vcpu_count": 4,
    "mem_size_mib": 16384
  },
  "vfio": [
    {
      "host_dev_path": "/dev/vfio/1"
    }
  ],
  "boot-source": {
    "kernel_image_path": "/path/to/vmlinux",
    "boot_args": "console=ttyS0 reboot=k panic=1"
  },
  "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "/path/to/rootfs.ext4",
      "is_root_device": true,
      "is_read_only": false
    }
  ]
}

Step 4: Verify CUDA inside the microVM

After boot, install NVIDIA drivers in the rootfs and confirm:

bash

# Inside the microVM
nvidia-smi
python3 -c "import torch; print(torch.cuda.is_available())"
# Should print: True

Note: MIG (Multi-Instance GPU) partitioning works with VFIO passthrough. If you create a MIG partition on the host before binding, you can pass a MIG slice (rather than the full GPU) to each microVM. This requires binding the MIG device node (/dev/nvidia0/gi0/ci0/nvidia-caps/...) rather than the physical function.

Deploying E2B Open Source on Spheron

Cluster setup

E2B OSS requires an orchestration node and one or more host nodes running Firecracker. The minimum recommended configuration is 1 orchestrator (CPU-only is sufficient) and 2 host nodes with H100 SXM5 instances for GPU-enabled sandboxes.

The E2B CLI handles cluster initialization. The core sequence:

bash

# Install E2B CLI
npm install -g @e2b/cli

# Initialize cluster config
e2b infra init --provider custom

# Register a host node (run from orchestrator)
e2b infra node add --host <SPHERON_INSTANCE_IP> --ssh-key ~/.ssh/id_rsa

# Verify cluster health
e2b infra status

The orchestrator runs the E2B API server, sandbox lifecycle manager, and template registry. Host nodes run the Firecracker VMM and expose a gRPC endpoint to the orchestrator. Each host node needs Docker (for template building), Firecracker binary, and the jailer companion process.

Building sandbox templates for Python agent environments

Templates are the pre-built rootfs snapshots your sandboxes start from. A well-built template cuts cold start time by 50-100ms versus pulling packages at runtime.

dockerfile

FROM ubuntu:22.04

# GPU support
RUN apt-get update && apt-get install -y \
    python3 python3-pip \
    cuda-toolkit-12-4 \
    libcudnn8 libcudnn8-dev

# Common ML dependencies baked in
RUN pip3 install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install \
    transformers datasets accelerate \
    numpy pandas matplotlib jupyter

# Agent environment
RUN pip3 install \
    langchain anthropic openai \
    requests httpx aiohttp

Build and register the template:

bash

# Build from Dockerfile
e2b template build --name python-ml-agent --dockerfile ./Dockerfile

# The build process creates a rootfs snapshot and registers it in the template registry
# First build: 5-15 minutes (downloading layers)
# Incremental rebuilds: 1-3 minutes

# Verify template
e2b template list

Once registered, sandboxes created from this template start from the pre-built snapshot, skipping all package installation.

Persistent volumes for agent workspaces

Multi-turn agent sessions need workspace persistence across sandbox pause/resume cycles. Mount a shared NFS volume or a block volume into the microVM filesystem at /workspace:

bash

# On the orchestrator, create an NFS export per tenant
mkdir -p /exports/agent-workspaces/tenant-123
echo "/exports/agent-workspaces/tenant-123 10.100.0.0/16(rw,sync,no_subtree_check)" >> /etc/exports
exportfs -ra

# In the Firecracker VM config, add the NFS mount via init
# Or use virtiofs to mount a host directory directly

For block volumes (faster than NFS for random I/O), create a per-workspace ext4 volume and attach it via Firecracker's drives config. This avoids NFS overhead for agents doing heavy file I/O, at the cost of per-workspace storage allocation.

GPU Access Patterns for Sandboxes

MIG partitioning for isolated GPU slices

H100 SXM5 supports up to 7 MIG instances using the 1g.10gb profile, giving each sandbox 10GB dedicated VRAM and a dedicated SM partition. This is strong isolation: one sandbox's VRAM usage does not affect another's.

bash

# Enable MIG mode on H100
nvidia-smi -i 0 -mig 1

# Create 7x 1g.10gb instances
nvidia-smi mig -cgi 1g.10gb -C
# Repeat 7 times, or use: nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C

# List instances
nvidia-smi mig -lgi

# Bind MIG instance to VFIO for passthrough
# Each instance appears as /dev/nvidia0/gi<N>/ci0/

MIG requires H100, H200, or A100 (MIG is not available on L40S and consumer GPUs; for those, use time-slicing only).

With 7 MIG slices per H100, a cluster of 5 H100 instances supports 35 fully isolated GPU-enabled sandboxes running simultaneously, each with 10GB dedicated VRAM.

Time-slicing for bursty agent workloads

When sandboxes run short GPU bursts (a few seconds of inference then idle for 10-30 seconds), time-slicing outperforms MIG in GPU utilization. Time-slicing context-switches between sandboxes on the GPU scheduler level, effectively oversubscribing the GPU.

Configure NVIDIA device plugin time-slicing via a Kubernetes ConfigMap (used if your E2B OSS cluster deploys on Kubernetes):

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: kube-system
data:
  config.json: |
    {
      "version": "v1",
      "flags": {
        "migStrategy": "none"
      },
      "sharing": {
        "timeSlicing": {
          "replicas": 10
        }
      }
    }

With 10 replicas, one physical H100 appears as 10 allocatable GPUs in the cluster. Each sandbox gets one slot; the hardware time-slices between them.

Trade-off: no VRAM isolation. A runaway sandbox that allocates all 80GB of H100 VRAM will OOM-kill other sandboxes on the same GPU. Time-slicing is the right choice for short, predictable executions. MIG is better for concurrent sandboxes running longer jobs where VRAM isolation matters.

Dedicated GPU per sandbox for large-model workloads

For SWE-bench harnesses or agent pipelines where the sandbox itself runs a 70B model (not calling an external LLM endpoint), dedicate one full H100 per sandbox. One H100 SXM5 at 80GB can hold a 70B FP8 model (~70GB) plus minimal KV cache for the agent's execution context.

Pattern	Isolation	Concurrent sandboxes per H100	VRAM per sandbox	Best use case
MIG (1g.10gb)	Strong (dedicated VRAM)	7	10 GB	Multi-tenant sandboxes, moderate compute
Time-slicing (10x)	Weak (shared VRAM)	10-50	Variable (shared 80GB)	Bursty, short GPU bursts, dev/test sandboxes
Dedicated	Full isolation	1	80 GB	Full-model inference inside sandbox, SWE-bench

Lifecycle Management: Cold Starts, Pooling, and Eviction

Snapshot-restore and sandbox pooling

Firecracker's snapshot-restore API is the core mechanism for sub-30ms sandbox creation. The process: boot a microVM to a ready state, snapshot its memory and block device state to local NVMe, then restore subsequent sandboxes from that snapshot instead of booting from scratch.

bash

# Create a snapshot of a running microVM via Firecracker API
# Firecracker's API server listens on a Unix socket (set via --api-sock at startup)
curl --unix-socket /run/firecracker.socket -X PATCH http://localhost/vm \
  -H "Content-Type: application/json" \
  -d '{"state": "Paused"}'

curl --unix-socket /run/firecracker.socket -X PUT http://localhost/snapshot/create \
  -H "Content-Type: application/json" \
  -d '{
    "snapshot_type": "Full",
    "snapshot_path": "/snapshots/python-ml-agent-base.snap",
    "mem_file_path": "/snapshots/python-ml-agent-base.mem"
  }'

# Restore a sandbox from snapshot (5-30ms instead of 125-200ms cold boot)
curl --unix-socket /run/firecracker.socket -X PUT http://localhost/snapshot/load \
  -H "Content-Type: application/json" \
  -d '{
    "snapshot_path": "/snapshots/python-ml-agent-base.snap",
    "mem_backend": {
      "backend_path": "/snapshots/python-ml-agent-base.mem",
      "backend_type": "File"
    }
  }'

Store snapshots on local NVMe for fast I/O (target: 2-5ms snapshot load time). The pool sizing formula:

pool_size = ceil(peak_rps * p99_restore_latency_seconds * 1.2)

For a system handling 20 sandbox requests per second at P99 restore latency of 20ms: ceil(20 * 0.020 * 1.2) = ceil(0.48) = 1. At this rate, a pool of 1 pre-restored sandbox is enough. For spikier workloads with 10x burst (200 RPS), the pool size depends on how p99 latency behaves under load. If p99 stays at 20ms: ceil(200 * 0.020 * 1.2) = ceil(4.8) = 5. If p99 degrades to 50ms under burst (common when NVMe is saturated): ceil(200 * 0.050 * 1.2) = ceil(12) = 12. Measure your actual p99 under burst before setting pool size. In practice, keep the pool size at 10-20 for bursty agent workloads where p99 latency tends to increase under sustained load.

Idle eviction policies

Sandboxes that stay idle consume VRAM and CPU even when not running code. Recommended thresholds:

Pause after 2 minutes idle: preserves sandbox state, frees CPU, but keeps VRAM allocated. Use for sessions where users are expected to return.
Destroy after 10 minutes paused: reclaims VRAM and adds the slot back to the pool. Use for all sandboxes; 10 minutes is long enough to cover normal user think time.

Eviction loop pseudocode:

python

async def eviction_loop():
    while True:
        for sandbox in await list_sandboxes():
            idle_seconds = now() - sandbox.last_activity_at
            if sandbox.state == "running" and idle_seconds > 120:
                await pause_sandbox(sandbox.id)
            # Use paused_at, not last_activity_at, so destroy fires 10 minutes after
            # pausing rather than 10 minutes after the last activity (which would be
            # only ~8 minutes after pause since 2 minutes of idle elapsed before pause).
            elif sandbox.state == "paused" and (now() - sandbox.paused_at) > 600:
                if sandbox.workspace_path:
                    await archive_workspace(sandbox.workspace_path)
                await destroy_sandbox(sandbox.id)
        await asyncio.sleep(30)

Destroyed sandboxes' workspaces should be archived to object storage (S3, GCS, or similar) before destruction if the session may resume. Archive must complete before destroy_sandbox runs — destroying the sandbox first risks deallocating the workspace storage (unmounting block devices or NFS mounts) while the archive task is still reading it, causing I/O errors or a truncated archive. Archive adds 1-3 seconds to the destroy path for typical workspaces; this is acceptable since the sandbox is already paused and its slot is not blocking new allocations.

Multi-Tenancy and Security

gVisor vs Firecracker isolation

gVisor intercepts syscalls in a user-space kernel (Sentry). The Sentry reimplements the Linux syscall interface in Go, inspecting every syscall before forwarding to the host. This gives strong auditability and a reduced attack surface on the host kernel, but at a cost: 10-15% CPU overhead, and no direct path to GPU hardware.

Firecracker uses hardware virtualization (KVM). The microVM has its own kernel, memory space, and virtual hardware. The attack surface is the hypervisor (Firecracker's VMM), not the host kernel. Firecracker's VMM is ~50K lines of Rust with a deliberately small device model. For multi-tenant workloads where tenants cannot trust each other, Firecracker provides stronger isolation guarantees than gVisor because compromising the sandbox requires exploiting the hypervisor, not a syscall implementation.

For regulated or high-security workloads requiring encrypted VRAM, see confidential GPU computing with NVIDIA TEE, which covers H100/H200 Confidential Computing with encrypted memory attestation.

Network egress controls and secret injection

Configure per-sandbox iptables rules on the host bridge interface before each sandbox starts. The orchestrator owns the ruleset and applies it per-sandbox at creation time:

bash

# Block all outbound from sandbox tap interface by default
# Use physdev module: tap-${SANDBOX_ID} is enslaved to br0, so the FORWARD chain
# sees br0 as ingress. Plain -i tap-${SANDBOX_ID} never matches; -m physdev --physdev-in
# matches the physical port inside the bridge and correctly restricts per-sandbox traffic.
iptables -I FORWARD -i br0 -m physdev --physdev-in tap-${SANDBOX_ID} -o eth0 -j DROP

# Allow only specific destinations (e.g., PyPI, GitHub)
iptables -I FORWARD -i br0 -m physdev --physdev-in tap-${SANDBOX_ID} -d 151.101.0.0/17 -o eth0 -j ACCEPT  # PyPI CDN
iptables -I FORWARD -i br0 -m physdev --physdev-in tap-${SANDBOX_ID} -d 140.82.112.0/20 -o eth0 -j ACCEPT  # GitHub

# Flush rules on sandbox destroy (remove ACCEPT rules first, then DROP last to avoid a window where traffic bypasses the allowlist)
iptables -D FORWARD -i br0 -m physdev --physdev-in tap-${SANDBOX_ID} -d 151.101.0.0/17 -o eth0 -j ACCEPT
iptables -D FORWARD -i br0 -m physdev --physdev-in tap-${SANDBOX_ID} -d 140.82.112.0/20 -o eth0 -j ACCEPT
iptables -D FORWARD -i br0 -m physdev --physdev-in tap-${SANDBOX_ID} -o eth0 -j DROP

For secret injection, never bake API keys into sandbox templates. Instead, fetch them from a secrets store at sandbox creation time and inject via environment variables:

python

# In the orchestrator, at sandbox creation
async def create_sandbox(tenant_id: str, session_id: str):
    secrets = await vault_client.get_secrets(
        path=f"agents/{tenant_id}/credentials"
    )
    env = {
        "ANTHROPIC_API_KEY": secrets["anthropic_key"],
        "OPENAI_API_KEY": secrets["openai_key"],
    }
    sandbox = await firecracker_client.boot_from_snapshot(
        snapshot_path="/snapshots/python-ml-agent-base.snap",
        env_vars=env,
    )
    return sandbox

Inject only the specific keys the agent needs for that session. Do not pass a full credential bundle; apply least-privilege at the secrets injection layer.

Observability: Tracing and Cost Attribution

Tracing agent tool calls through sandbox executions

Each sandbox execution triggered by an agent tool call should emit an OpenTelemetry span with enough context to attribute cost and debug failures:

python

from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

tracer = trace.get_tracer("sandbox-orchestrator")

async def execute_code(
    agent_id: str,
    session_id: str,
    tenant_id: str,
    code: str,
    parent_context: dict,
    sandbox: FirecrackerSandbox,
):
    ctx = TraceContextTextMapPropagator().extract(carrier=parent_context)

    with tracer.start_as_current_span(
        "sandbox.execute",
        context=ctx,
        attributes={
            "agent.id": agent_id,
            "session.id": session_id,
            "tenant.id": tenant_id,
        },
    ) as span:
        result = await sandbox.run_code(code)
        span.set_attribute("execution.duration_ms", result.duration_ms)
        span.set_attribute("gpu.utilization_pct", result.gpu_util_pct)
        span.set_attribute("execution.exit_code", result.exit_code)
        return result

Pass traceparent headers from the LLM call layer through to the sandbox orchestrator so execution spans appear as children of the agent's LLM spans in Jaeger or Grafana Tempo. This gives you end-to-end traces showing: LLM call time, tool routing overhead, and code execution duration per agent run.

Sandbox metrics and cost attribution

Expose Prometheus metrics from the orchestrator for cluster health and per-tenant cost attribution:

# Active sandboxes
sandbox_active_count{tenant="tenant-123"} 12

# Cold starts (high values indicate pool too small)
sandbox_cold_starts_total{template="python-ml-agent"} 847

# Execution duration histogram
sandbox_execution_duration_seconds_bucket{tenant="tenant-123", le="1.0"} 2341

# GPU compute consumed per tenant (for billing)
sandbox_gpu_seconds_used_total{tenant="tenant-123"} 84600

Cost attribution PromQL: multiply GPU seconds by the per-second GPU cost and convert to a readable metric:

# GPU cost per tenant per hour (at $0.80/hr H100 SXM5 spot)
(
  rate(sandbox_gpu_seconds_used_total[1h]) * 0.80
) by (tenant)

This gives you a per-tenant GPU cost rate in $/hr, updated every scrape interval. Use this to power usage-based billing or to flag tenants over budget thresholds.

Cost Model: Self-Hosted vs Managed Pricing

The H100 SXM5 spot price on Spheron is currently $0.80/hr per GPU (spot pricing is well-suited for sandbox host nodes, which can be drained and rescheduled on interruption). Monthly cost for one H100 instance at full utilization: $0.80 * 720 = $576/month.

E2B managed costs:

Volume	Avg duration	E2B managed $/month	Notes
1K sandboxes/day	60s	~$25	($0.000014 60s 1K * 30 days)
10K sandboxes/day	60s	~$252	Small team, active usage
100K sandboxes/day	60s	~$2,520	Production scale, CPU-only
Any volume	Requires GPU	Not available (managed tier)	Must self-host for GPU

Self-hosted Firecracker on Spheron H100 SXM5:

A single H100 instance running E2B OSS can handle approximately 200 concurrent sandboxes with MIG (7 slices) and time-slicing combined. At 10K sandboxes/day with 60s average duration:

Volume	Duration	Spheron H100 $/month (spot)	Per-execution cost
10K/day	60s	$576	$0.0019
100K/day	60s	$576	$0.00019
1M/day	60s	Need 2-3 H100s (~$1,728)	$0.000058

Self-hosted break-even vs managed E2B for CPU-only workloads is at roughly 11,000 sandbox-hours/month (about 15 sandboxes running continuously). Below that, managed E2B is cheaper after accounting for ops overhead. Above that, self-hosting wins on unit economics.

For GPU-enabled sandboxes, there is no managed comparison (E2B managed does not offer GPU). The self-hosted cost is the only option.

For a 5-H100 reference cluster at spot pricing: 5 * $0.80 * 720 = $2,880/month. At 1,000 concurrent GPU-enabled sandboxes running 2-minute tasks at 10% duty cycle, active sandbox-hours/day = 1000 * (2/60) * 0.10 * 24 = 80 sandbox-hours/day, giving a cost of $2,880 / (80 * 30) = $1.20/sandbox-GPU-hour. This is the baseline at 10% duty cycle. At higher duty cycles (50%), GPU utilization rises and cost per sandbox-hour drops proportionally, as shown in the Reference Architecture block below.

Pricing fluctuates based on GPU availability. The prices above are based on 27 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

A Spheron H100 SXM5 bare metal instance, used as a Firecracker sandbox host, gives you the VRAM headroom and PCIe passthrough access that sandbox density at this scale requires. See H100 SXM5 on Spheron for current availability and pricing.

Reference Architecture: 1,000 Concurrent Agent Sandboxes

A production SWE-bench-scale agent harness at 1,000 concurrent GPU-enabled sandboxes:

Orchestration layer: E2B OSS API server running on a dedicated CPU-only coordinator node. Handles sandbox create/pause/resume/destroy, template registry, and scheduling. No GPU required on the coordinator.

Host nodes: 5 x Spheron H100 SXM5 bare metal instances. Each runs Firecracker and VFIO, with the H100 split into 7 MIG 1g.10gb slices. That gives 7 dedicated GPU slices per host, and 35 total across the cluster. Adding time-slicing on top of MIG slices for short-duration sandboxes pushes concurrent capacity to 350-700 sandboxes, depending on execution pattern.

Snapshot store: Local NVMe on each host for fast restore (target: 2-5ms). Cross-host snapshot replication via rsync to a shared object store for resilience.

Networking: Per-sandbox /30 subnets, host-level iptables enforcing egress policy, a NAT gateway for allowed outbound traffic to PyPI, GitHub, and configured API endpoints.

Observability: Prometheus scraping each Firecracker host for sandbox metrics. Grafana for dashboards. OTEL traces piped to Grafana Tempo. Cost attribution query running on 5-minute intervals.

Cost at this scale:

5 hosts * $0.80/hr (spot) * 720 hr/month = $2,880/month

At 1,000 concurrent sandboxes, 2-minute avg task duration, 10% duty cycle:
  Active sandbox-hours/day = 1000 * (2/60) * 0.10 * 24 = 80 sandbox-hours/day
  GPU-hours/day = 80 / 7 (MIG slices per GPU) = ~11.4 GPU-hours/day
  GPU utilization: 11.4 / (5 * 24) = ~9.5% of total GPU capacity

At higher duty cycles (50%): GPU utilization rises to ~47%, and cost per sandbox-hour drops proportionally.

The key insight: GPU infrastructure costs are largely fixed (you pay for provisioned capacity, not per execution). Efficiency comes from keeping GPUs busy, which means right-sizing pool size and task duration.

Spheron bare metal GPU instances, with direct KVM access and PCIe passthrough, are the right foundation for Firecracker and E2B OSS clusters where managed serverless platforms fall short. If you're building an agent platform that runs code at scale, start with an H100 on bare metal and measure your real per-execution cost before committing to a managed sandbox vendor.
Rent H100 on Spheron → | View GPU pricing → | Get started →

Why Agents Need Code Execution Sandboxes

Security: agents run untrusted code

GPU access: ML agents need accelerated compute inside sandboxes

Persistent state: workspaces survive multi-turn agent sessions

Sandbox Architecture Comparison

Self-Hosting Firecracker microVMs on Spheron Bare Metal

Why bare metal is required

Network setup for sandboxes

GPU passthrough with VFIO-PCI

Deploying E2B Open Source on Spheron

Cluster setup

Building sandbox templates for Python agent environments

Persistent volumes for agent workspaces

GPU Access Patterns for Sandboxes

MIG partitioning for isolated GPU slices

Time-slicing for bursty agent workloads

Dedicated GPU per sandbox for large-model workloads

Lifecycle Management: Cold Starts, Pooling, and Eviction

Snapshot-restore and sandbox pooling

Idle eviction policies

Multi-Tenancy and Security

gVisor vs Firecracker isolation

Network egress controls and secret injection

Observability: Tracing and Cost Attribution

Tracing agent tool calls through sandbox executions

Sandbox metrics and cost attribution

Cost Model: Self-Hosted vs Managed Pricing

Reference Architecture: 1,000 Concurrent Agent Sandboxes

Build what's next.