Run Karpathy's auto research continuously to optimize a model

Research Log

Autonomous GPT pretraining research using Karpathy's autoresearch framework. We fork belindamo/autoresearch on branch autoresearch/sundial, modify train.py hyperparameters and architecture, and run 5-minute training experiments on Modal A100-80GB GPUs. The goal is to minimize val_bpb (validation bits per byte). Current best: 1.0387 (down from 1.1186 baseline).

Sessions

Session 1 (2026-03-12): Set up fork, Modal runner, and ran 4 experiments — baseline (1.1186), higher matrix LR (1.0776 ✓), deeper model (1.1471 ✗), smaller batch size (1.0591 ✓). Full log
Session 2 (2026-03-12): Halved batch size again (2^18→2^17, devicebatch=64) — 1357 steps, valbpb=1.0472 ✓ new best. Full log
Session 3 (2026-03-12): Halved batch further (2^17→2^16, devicebatch=32) — 2557 steps but valbpb=1.0584 ✗ regressed, gradient noise too high. Full log
Session 4 (2026-03-13): Increased embedding LR (0.6→1.0) — val_bpb=1.0458 ✓ small improvement, zero complexity cost. Full log
Session 5 (2026-03-13): Compound: added 2% warmup + matrix LR 0.06→0.08 — val_bpb=1.0437 ✓ new best, improved throughput. Full log
Session 6 (2026-03-13): Reduced warmdown ratio 0.67→0.5 — val_bpb=1.0557 ✗ regressed, long warmdown is important for convergence. Full log
Session 7 (2026-03-13): Reduced weight decay 0.2→0.1 — val_bpb=1.0420 ✓ new best, less regularization helps in short training runs. Full log
Session 8 (2026-03-13): Reduced weight decay 0.1→0.05 — val_bpb=1.0395 ✓ new best, continuing the trend of less regularization for short runs. Full log
Session 9 (2026-03-13): Reduced weight decay 0.05→0.025 — val_bpb=1.0437 ✗ regressed, 0.05 is the sweet spot; further reduction is counterproductive. Full log
Session 10 (2026-03-13): Increased logit softcap 15→30 — val_bpb=1.0440 ✗ regressed, wider logit range hurts; softcap=15 provides beneficial regularization. Full log
Session 11 (2026-03-13): Increased warmdown ratio 0.67→0.75 — val_bpb=1.0401 ✗ regressed, 0.67 is optimal (both higher and lower hurt). Full log
Session 12 (2026-03-13): Increased scalar LR 0.5→0.8 — val_bpb=1.0401 ✗ regressed, scalar LR=0.5 is already well-tuned for per-layer gating. Full log
Session 13 (2026-03-13): Increased matrix LR 0.08→0.1 — val_bpb=1.0413 ✗ regressed, Muon optimizer LR=0.08 is optimal; higher causes overshooting. Full log
Session 14 (2026-03-14): Increased unembedding LR 0.004→0.008 — valbpb=1.0468 ✗ regressed, lmhead is sensitive to LR; 0.004 is well-calibrated. Full log
Session 15 (2026-03-14): Increased warmup ratio 0.02→0.05 — val_bpb=1.0387 ✓ new best, longer warmup stabilizes early training for Muon optimizer. Full log
Session 16 (2026-03-14): Increased warmup ratio 0.05→0.08 — val_bpb=1.0500 ✗ regressed, too much warmup eats into effective training time; 0.05 is the sweet spot. Full log
Session 17 (2026-03-14): Lowered softcap 15→10 — val_bpb=1.0408 ✗ regressed, tighter logit clamping too restrictive; softcap=15 is optimal (both 10 and 30 are worse). Full log
Session 18 (2026-03-14): Increased embedding LR 1.0→1.5 — val_bpb=1.0475 ✗ regressed, embeddings overshoot at higher LR; 1.0 is optimal (0.6 < 1.0 > 1.5). Full log
Session 19 (2026-03-14): Lowered matrix LR 0.08→0.07 — val_bpb=1.0473 ✗ regressed, Muon LR=0.08 confirmed optimal from both sides (0.07 < 0.08 > 0.1). Full log
Session 20 (2026-03-14): Increased FINALLRFRAC 0.0→0.01 — val_bpb=1.0472 ✗ regressed, full annealing to zero is important for warmdown convergence. Full log
Session 21 (2026-03-14): Increased Adam beta1 0.8→0.9 — val_bpb=1.0517 ✗ regressed, higher momentum too sluggish for short training; beta1=0.8 is well-calibrated. Full log
Session 22 (2026-03-14): Increased x0lambda init 0.1→0.15 — valbpb=1.0484 ✗ regressed, stronger skip connections dilute residual stream; init=0.1 is well-calibrated. Full log
Session 23 (2026-03-14): Reduced Muon momentum target 0.95→0.90 — valbpb=1.0470 ✗ regressed vs 1.0387 best, though at same-throughput comparison it's neutral. Also reverted unlogged beta2/x0lambda commits. Notable GPU variance: session 15's best (1.0387) was at 1560 steps/16.35% MFU vs typical ~1340 steps/14% MFU. Full log
Session 24 (2026-03-14): Reduced warmdown ratio 0.67→0.60 — val_bpb=1.0491 ✗ regressed, completing the warmdown map: 0.50→1.0557, 0.60→1.0491, 0.67→best, 0.75→1.0401. Full log
Session 25 (2026-03-15): Increased depth 8→10 (85.9M params, 811 steps) — val_bpb=1.0610 ✗ regressed, larger model gets too few training steps in 5 min to converge; depth 8 is optimal for this time budget. Full log
Session 26 (2026-03-15): Reduced x0lambda init 0.1→0.05 — valbpb=1.0487 ✗ regressed, completing the map: 0.05→1.0487, 0.10→best, 0.15→1.0484; x0_lambda=0.1 confirmed optimal. Full log
Session 27 (2026-03-15): Increased RoPE base frequency 10000→50000 — valbpb=1.0405 ✗ marginal regression (0.0018), near-neutral; model insensitive to RoPE base at seqlen=2048. Full log
Session 28 (2026-03-15): Muon nssteps 5→6 (no-op: only 5 coefficients exist) — valbpb=1.0492, effectively a baseline re-run confirming ~0.01 GPU variance between allocations. Full log
Session 29 (2026-03-15): Window pattern SSSL→SSL (more full-context layers, 25%→37.5%) — valbpb=1.0492 ✗ regressed, neutral throughput but no quality gain; SSSL is optimal at seqlen=2048. Full log
Session 30 (2026-03-15): Increased ASPECTRATIO 64→72 (wider model 512→640 dim, 70.8M params) — valbpb=1.0521 ✗ regressed, 25% fewer steps (999 vs 1340); model too large for 5-min budget. Full log
Session 31 (2026-03-15): Six experiments: Adam beta1=0.7, HEADDIM=64, scalarlr=0.3, cosine warmdown, SiLU activation, Muon beta2=0.99 — all regressed. Model deeply optimized for this time budget. Full log
Session 32 (2026-03-16): Depth 9 at same width 512 (1.054 ✗), weight tying (4.68 ✗ catastrophic — dual LR incompatible), z-loss regularization (1.058 ✗ redundant with softcap), WD=0.04 (infra timeout). Best val_bpb unchanged at 1.0387. Full log
Session 33 (2026-03-16): Six experiments: SwiGLU MLP (1.068 ✗), GQA nkvhead=2 (1.044 ✗ but 20% more steps), WD=0.04 (1.047 ✗), EMA=0.98 (1.050 ✗), torch.compile reduce-overhead (1.045 ✗ but 12% throughput gain), Muon momentum start=0.90 (1.052 ✗). Architecture deeply optimized; throughput gains don't compensate for capacity loss. Full log
Session 34 (2026-03-17): Five experiments: label smoothing 0.05 (1.226 ✗ catastrophic with softcap), gradient clipping max_norm=1.0 (1.047 ✗ redundant with Muon), Muon momentum warmup 300→100 (1.047 ✗), scheduled EMA 0.95→0.999 (1.044 ✗ on fast GPU), VE gate channels 32→64 (1.047 ✗). All dimensions well-explored; GPU variance accounts for most variation. Full log
Session 35 (2026-03-17): Six experiments: EMA decay 0.995 (1.064 ✗ massive regression, averaging window too large), Muon momentum target 0.975 (1.051 ✗ overshoots), MLP expansion 3x (1.051 ✗ same steps, less capacity), Muon beta2 0.90 (1.052 ✗), random seed 137 (1.054 ✗ confirms GPU variance dominates), all-long attention (1.049 ✗ despite fastest GPU at 17.3% MFU). Model deeply optimized at 1.0387. Full log
Session 36 (2026-03-18): Six experiments: constant WD (1.046 ✗ interferes with warmdown convergence), DEVICEBATCHSIZE=32 (1.107 ✗ catastrophic, grad_accum overhead kills throughput), parallel attention+MLP (1.065 ✗ MLP needs attention output), Adam beta2=0.90 (1.055 ✗ too noisy), QK rescaling q*sqrt(d) (1.087 ✗ breaks distributed attention), standard WD no cautious mask (1.054 ✗ over-regularizes). 30+ consecutive regressions; model at compute-optimal frontier. Full log
Session 37 (2026-03-18): Six experiments: WARMUPRATIO=0.04 (1.044 ✗ completing warmup map: 0.04<0.05>0.06>0.08), WARMUPRATIO=0.06 (1.048 ✗), RoPE base 5000 (1.054 ✗ completing map: 5000<10000>50000), per-layer x0_lambda decay (1.045 ✗ neutral/complex), torch.compile max-autotune (TIMEOUT ✗), VE last-2-layers only (1.049 ✗ early VE matters). 36+ consecutive regressions. Full log
Session 38 (2026-03-18): Six experiments: depth=7 (1.052 ✗ less capacity outweighs more steps; depth map: 7<8>9>10), Muon momentum warmup 500 steps (1.046 ✗ neutral; map: 100<300>500), EMA disabled (1.042 ✗ but PROMISING — beat expected baseline at same GPU speed, raw weights after full warmdown may be better than EMA), Muon momentum start 0.80 (1.054 ✗; map: 0.80<0.85>0.90), matrix LR 0.09 (1.054 ✗ slow GPU, ambiguous), sqrt warmdown (1.051 ✗ LR stays too high too long). 42+ consecutive regressions. Full log

Project totals

38h 28m

Runtime

$173.15

Cost

Files

Sessions

Skills

📋research-log 🎯skill