karpathy/nanochat

This section covers configuring model size and training horizon when training base models from scratch, designed for users running training jobs on GPU clusters (such as 8x H100 nodes) to achieve targeted capabilities like matching GPT-2 performance on the CORE metric. These settings control the Transformer’s layer count, total training tokens relative to model parameters, and memory-efficient batch sizing, enabling efficient scaling within hardware limits. For the full base training workflow, see Training Base Models. For progress tracking, see Monitoring and Checkpoints. All available options are detailed in Configuration Reference.

Overview

Model size is set by the number of Transformer layers (depth), which automatically scales related dimensions like width, attention heads, and learning rates for balanced performance. The training horizon defines how many tokens to process, calculated as non-embedding parameters multiplied by target-param-data-ratio—a compute-optimal scaling law target (default around 10.5). Batch size adjusts dynamically for memory constraints via device-batch-size, using gradient accumulation to maintain effective total batch size (ideally 524,288 tokens). Together, these let you tune for speedruns to GPT-2 equivalence (~0.256+ CORE score) in under 4 hours.

Model Size (Depth)

The depth setting determines the number of Transformer layers, directly impacting model capacity (parameters). Higher values increase parameters for better capabilities but require more memory and compute.

  • Even depths (e.g., 24, 26) are recommended for clean scaling of dimensions like head count.
  • Parameters exclude embeddings; actual count shown in training logs.
  • Auto-scales: width, heads, initial learning rate, and other hyperparameters proportionally.
Depth Approx. Parameters Notes
18 ~900M Smaller, faster training; baseline for testing.
24 ~1.4B Targets GPT-2 with higher ratio (e.g., 12).
26 ~1.6B GPT-2 equivalent; use lower ratio (e.g., 8.25) for undertraining efficiency.

[!NOTE]
Start with depth 24 for most speedruns to GPT-2. Odd depths (e.g., 25) may lead to suboptimal head dimensions.

Training Horizon (Target Param-Data Ratio)

The target-param-data-ratio sets the tokens-to-parameters ratio, controlling total training steps (tokens = parameters × ratio ÷ (batch size × sequence length)). Lower values undertrain for faster runs; 10.5 is compute-optimal per scaling laws.

  • Training ends automatically when the horizon is reached.
  • Adjust lower for larger depths to hit exact capability targets like GPT-2 CORE score.
Ratio Training Steps (at ideal batch) Use Case
——- ——————————— ———-```
8.25 ~16,000 Undertrain depth 26 for GPT-2 equiv.
10.5 ~20,000 Compute-optimal default.
12 ~24,000 Overtrain depth 24 for GPT-2 equiv.

Batch Size Adjustments

device-batch-size sets tokens per device before distribution across GPUs. The system targets a total batch size of 524,288 (32 × 2048 seq len × 8 GPUs) via automatic gradient accumulation.

  1. Set to 32 (ideal) if memory allows.
  2. Halve to 16, 8, etc., if out-of-memory occurs; accumulation steps double accordingly (e.g., 2 steps for 16).
  3. Keep powers of 2 for clean math.
device-batch-size Accumulation Steps Total Batch Size Memory Impact
32 1 524,288 Highest throughput.
16 2 524,288 For larger depths (e.g., 26).
8 4 524,288 Fallback for tight memory.

[!WARNING]
Too-high batch size causes out-of-memory crashes. Monitor GPU memory during initial steps and reduce if needed.

Step-by-Step: Launching a Configured Training Run

Follow these steps to start training with custom size and horizon:

  1. Prepare environment per Getting Started > Installation and Environment Setup.
  2. Train tokenizer if needed (see Tokenizer Training and Evaluation).
  3. Open terminal on your GPU node.
  4. Run the base training command, adding your settings:
    • Include –depth (e.g., 24).
    • Set –target-param-data-ratio (e.g., 12).
    • Adjust –device-batch-size (e.g., 16).
    • Disable extras for speedruns: –sample-every=-1, –save-every=-1, –core-metric-every=999999 (final eval only).
  5. Example for GPT-2 speedrun (depth 26):
    OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 scripts.base_train --depth=26 --target-param-data-ratio=8.25 --device-batch-size=16 --run="my-run" --model-tag="my-model"
    
  6. Monitor via WandB (logs core_metric, total_training_time); stop manually if needed.
graph TB
    subgraph "Planning"
        A["Choose depth<br/>(e.g., 24 or 26)"] --> B["Pick ratio<br/>(e.g., 12 or 8.25)"]
        B --> C["Set device-batch-size<br/>(32 → 16 if OOM)"]
    end
    subgraph "Launch"
        C -->|"Run base_train<br/>with flags"| D["Training starts<br/>Auto-scales params"]
        D --> E{"Memory OK?"}
        E -->|No| F["Halve batch size<br/>Relaunch"]
        E -->|Yes| G["Train to horizon<br/>~3-4 hours"]
    end
    G --> H["Final CORE eval<br/>Check vs GPT-2"]

Troubleshooting

Common issues during configuration:

Message Severity Meaning
Out of memory during forward pass Error Batch too large for model depth/hardware. Reduce device-batch-size by half and relaunch.
CORE metric below 0.256 after run Info Undertrained; increase target-param-data-ratio or use smaller depth. Rerun with full eval (–core-metric-max-per-task=-1).
Uneven gradient accumulation steps Warning device-batch-size not power of 2; adjust to 16, 32, etc., for stability.
Training time exceeds 4 hours Warning High ratio or low batch; verify total_training_time in logs excludes evals.

Summary

Generated by ESX Wiki