Tutorial

Run Karpathy's autoresearch on a GPU VM: 100 Experiments Overnight

Back to BlogWritten by Mitrasish, Co-founderMar 14, 2026
GPU CloudLLM TrainingAI ResearchH100Autonomous MLGPU VMAndrej KarpathyML Experimentation
Run Karpathy's autoresearch on a GPU VM: 100 Experiments Overnight

Andrej Karpathy dropped autoresearch on March 6th, 2026. It hit over 33,000 GitHub stars in the first week. The premise: give an AI agent a GPU and a training script, let it modify the script, run experiments, and keep the changes that improve performance. Repeat overnight. Wake up to 80-100 completed experiments and a model architecture that's measurably better than the one you started with.

The critical requirement is a machine that runs bare Linux and CUDA continuously. No container overhead. No idle timeouts. No notebook session that dies at 2 AM. That's exactly what a Spheron GPU VM or bare metal instance provides.

What autoresearch actually does

autoresearch is built around three files: prepare.py, train.py, and program.md.

prepare.py is fixed and untouchable. It downloads the ClimbMix dataset (Karpathy's own climbmix-400b-shuffle on HuggingFace), trains a BPE tokenizer, and writes sharded data files to disk. You run it once. The agent never modifies it.

train.py is the target. It contains a GPT model definition, optimizer setup, and training loop. The agent reads it, proposes a change, writes the modified version, runs it, and evaluates the result. This is the only file the agent is allowed to touch.

program.md is your instruction file. You write it. It tells the agent what direction to explore: "try different attention mechanisms," "optimize for throughput on 32GB VRAM," "focus on learning rate schedules." The agent reads this at the start of each session and uses it to guide its hypotheses.

The metric is val_bpb (validation bits-per-byte). Lower is better. It measures cross-entropy loss normalized by the byte length of the target tokens, so results are comparable across architectures with different vocabulary sizes or tokenizers.

The loop runs like this:

  1. Agent reads program.md and the current train.py
  2. Agent proposes a change based on a hypothesis
  3. Agent writes the modified train.py
  4. uv run train.py executes for exactly 300 seconds, then stops
  5. Agent reads the final val_bpb from output
  6. If val_bpb improved: git commit with the score in the message
  7. If val_bpb got worse: git revert, hypothesis discarded
  8. Repeat from step 1

Because each experiment is capped at exactly 300 seconds, you get roughly 12 experiments per hour on any hardware, including an H100 SXM5 80GB. The difference between GPUs is how many training steps fit inside those 300 seconds, not how many experiments run per hour. An overnight run of about 8 hours produces roughly 100 experiments. Each accepted experiment is committed to git with the val_bpb score in the message, giving you a complete audit trail of every architectural change the agent tried and kept.

For background on LLM training experimentation and hyperparameter choices, see our guide on how to fine-tune LLMs in 2026.

Why it needs a real GPU VM, not a container

autoresearch uses uv, a Python package manager, directly on the host system. You run uv sync, uv run prepare.py, uv run train.py. No Docker. No container runtime. No virtualization layer between your code and the CUDA drivers.

This matters for three reasons.

The 5-minute fixed time budget is your throughput ceiling. Each training experiment gets exactly 300 seconds of wall-clock time. On an H100, the model trains for 5 full minutes and you see the result. On slower hardware, the same 5 minutes produces fewer training steps, which means less signal per experiment. Any overhead between your GPU and your code, a container runtime, a hypervisor, a shared CUDA context, steals from that 300-second budget. Bare metal and GPU VMs give you full hardware throughput with nothing in between.

Cloud notebook environments kill idle sessions. Colab will interrupt your session after 90 minutes if the browser tab isn't active. Kaggle kernels have similar constraints. An autonomous agent loop running for 10 hours on a Colab instance is not a realistic setup. It will die. Spheron GPU VMs give you a persistent machine with SSH access. The process runs whether or not you're watching.

Full root access and no shared GPU context. On shared GPU environments, CUDA memory isn't exclusively yours. Other jobs can consume VRAM, affecting your training memory budget unpredictably. On a dedicated GPU VM or bare metal instance, the entire GPU is yours. Your VRAM budget is deterministic. You know exactly what model configs will fit.

For the broader case for bare metal over containers in ML workflows, see how to build GPU infrastructure for AI agents in 2026.

Choosing your GPU on Spheron

The original autoresearch was developed and benchmarked on an H100 80GB. There's no stated minimum VRAM in the README. The project explicitly supports smaller hardware. Consumer GPUs work if you tune down DEPTH in train.py and MAX_SEQ_LEN/VOCAB_SIZE in prepare.py (rerun prepare.py after changing those values). For non-NVIDIA hardware such as Apple Silicon Macbooks, community forks exist. The main repo requires a single NVIDIA GPU.

GPUVRAMBest forOn-Demand (as of Mar 2026)Spheron Page
H100 SXM5 80GB80GBFull default config, max experiments/hrfrom ~$2.50/hr/gpu-rental/h100/
H200 SXM5 141GB141GBLarger models, even more headroomfrom ~$4.54/hr/gpu-rental/h200/
A100 SXM4 80GB80GBProven alternative to H100, slightly slowerfrom ~$1.65/hr/gpu-rental/a100/
L40S 48GB48GBMid-range, good throughput per dollarfrom ~$0.72/hr/gpu-rental/l40s/
RTX 5090 32GB32GBConsumer budget option, reduce model config~$0.76/hr/gpu-rental/rtx-5090/
RTX 4090 24GB24GBMinimum viable, requires tuned-down config~$0.58/hr/gpu-rental/rtx-4090/

See current GPU pricing for up-to-date rates. For a detailed breakdown of which GPU fits which VRAM budget, check the GPU requirements cheat sheet for 2026.

Step-by-step setup on Spheron

1. Provision and SSH in

Log into Spheron, select an H100 SXM5 80GB instance from the GPU catalog, choose your region, and provision. You'll get SSH credentials in the dashboard. Connect:

bash
ssh root@<your-instance-ip>

Verify CUDA is available:

bash
nvidia-smi

You should see your GPU and driver version.

2. Install uv and clone autoresearch

bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc  # or open a new terminal to reload PATH

# Clone the repo
git clone https://github.com/karpathy/autoresearch
cd autoresearch

# Install all Python dependencies
uv sync

uv sync reads the pyproject.toml and installs everything into a local virtual environment. This takes about 60 seconds on a fresh instance.

3. One-time data preparation

bash
uv run prepare.py

This downloads the ClimbMix dataset (karpathy/climbmix-400b-shuffle) from Hugging Face, trains a BPE tokenizer, and writes sharded binary data files. Takes about 2 minutes on a fast VM with good network. The output is cached locally. You only run this once per instance.

4. Baseline training run

Before starting the agent loop, establish your baseline val_bpb:

bash
uv run train.py

Training runs for exactly 300 seconds and stops. The final output line shows your val_bpb score. Write it down. This is the benchmark the agent needs to beat on every subsequent experiment.

5. Set up tmux for overnight runs

Before starting the agent loop, open a tmux session so your work survives a dropped SSH connection:

bash
tmux new -s autoresearch

If your connection drops overnight, reconnect and run tmux attach -t autoresearch to pick up where you left off. Without this, a network hiccup at 3 AM kills your entire overnight run.

6. Start the autonomous agent loop

Open Claude Code (or your AI coding agent of choice) and point it at the repository. Give it these instructions:

  1. Read program.md
  2. Read the current train.py
  3. Propose one hypothesis for improving val_bpb
  4. Modify train.py to implement the hypothesis
  5. Run uv run train.py and record the val_bpb result
  6. If val_bpb improved: git commit -m "experiment: <description> val_bpb=<score>"
  7. If val_bpb got worse: revert train.py to the committed version
  8. Go to step 2 and repeat indefinitely

Let it run. Go to sleep.

Customizing program.md for your research direction

program.md is the human-editable file that steers the agent's hypotheses. The default file in the repo gives the agent general guidance. You can replace it with something more specific to your goals.

Examples of what to put in program.md:

  • "Explore learning rate warmup schedules, specifically cosine vs linear. Compare each against the baseline."
  • "Try different attention mechanisms. Start with grouped query attention as a replacement for standard multi-head attention."
  • "We are on a 32GB GPU. Focus on optimizing for VRAM efficiency while maintaining or improving val_bpb."
  • "The model is already achieving good val_bpb. Focus on reducing training time per step to fit more experiments in the 300-second window."

For RTX 4090 (24GB) or RTX 5090 (32GB) users, add to program.md:

Hardware constraint: 24GB VRAM. Any change that causes OOM must be reverted immediately.
Start with: reduce DEPTH to 6 in train.py.

Then set DEPTH = 6 in train.py before starting the agent loop. If you also want to reduce sequence length or vocabulary size, edit MAX_SEQ_LEN and VOCAB_SIZE in prepare.py and rerun uv run prepare.py to regenerate the tokenizer and data shards. Those constants live in prepare.py, not train.py.

What the agent actually changes

The agent can modify anything inside train.py:

  • GPT model architecture (number of layers, heads, embedding dimension)
  • Attention mechanism (standard multi-head, grouped query, sliding window)
  • Optimizer choice and hyperparameters (AdamW, Lion, Muon, learning rate, weight decay)
  • Learning rate schedule (cosine, linear decay, warmup steps)
  • Regularization (dropout, weight tying, gradient clipping)
  • Batch size and gradient accumulation steps
  • Sequence length and positional encoding
  • Activation functions (GELU, SiLU, ReLU variants)

The agent cannot modify prepare.py. The 300-second wall-clock limit is enforced externally and cannot be changed. The val_bpb evaluation logic is fixed.

Every accepted experiment is committed to git with the score in the commit message. You get a complete audit trail of what the agent tried, what worked, and what didn't. This is also the agent's memory across sessions. If you stop and restart, the agent can read the git log to understand what's already been explored.

Reading the overnight results

Check the git log in the morning:

bash
git log --oneline

You'll see every committed experiment with its val_bpb score. Reverted experiments don't appear because they were never committed.

To inspect a specific experiment:

bash
git show <commit-hash>

This shows exactly what the agent changed in train.py and what the resulting score was.

Typical overnight results on an H100: 80-100 total experiments, 15-30 kept as genuine improvements. The first few commits tend to be large structural changes (different optimizer, different attention mechanism). Later commits get smaller as the low-hanging fruit gets picked and the agent explores more targeted tuning.

Compare your final val_bpb against your baseline from step 4. If you see flat results or regressions, check program.md to make sure it's actually steering the agent toward productive territory.

For more context on how autonomous AI agents interact with GPU compute, see GPU infrastructure for AI agents in 2026.


If you want 100 GPU experiments while you sleep, you need a machine that stays on, doesn't idle out, and gives you full CUDA access. A Spheron GPU VM or bare metal instance checks all three: persistent SSH, dedicated GPU, direct CUDA with no container overhead.

Provision an H100 instance if you want to run the default config out of the box. Use an RTX 4090 or RTX 5090 with a reduced model config if you want to keep costs down while still running overnight experiments. Either way, the setup takes under 10 minutes and the agent does the rest.

Explore GPU options on Spheron

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.