karpathy/nanochat

This section covers monitoring training progress, saving and loading checkpoints, generating samples, and running CORE evaluations during base model training. It is for users actively training Transformer models on their hardware, building on model size and horizon configuration from Configuring Model Size and Training Horizon. Use these features to track metrics like training steps, time, FLOPs, and token throughput in real-time; periodically save model states for resumption or evaluation; and benchmark against GPT-2 capability using the final CORE metric. For full evaluation workflows post-training, see Base Model Evaluation. For leaderboard participation, see Leaderboard and Optimization.

Overview

During base model training, the system continuously logs key metrics to a dashboard for real-time monitoring, saves checkpoints at configurable intervals, generates text samples to inspect model quality, and runs periodic or final evaluations using the CORE metric (an ensemble over tasks like ARC and MMLU). These tools help you assess training efficiency, detect issues early, and ensure your model reaches target capability. All monitoring integrates with external logging services, while checkpoints enable pausing/resuming runs and deploying models for chatting or further finetuning in Training Chat Models.

Real-Time Monitoring

Training progress appears in your logging dashboard (enabled via the run setting). View charts and tables updating live with:

  • step: Current optimization step.
  • total_training_time: Wall-clock time of training iterations (seconds, excluding evals/logging).
  • total_training_flops: Cumulative FLOPs consumed.
  • core_metric: Final or periodic CORE score (must exceed 0.256525 for GPT-2 equivalence).
  • Bits-per-byte (bpb) on validation data.
  • Token throughput (tok/sec) and utilization metrics.

At run end, a summary table shows final values, e.g.:

Metric Example Value
core_metric 0.25851
step 16704
total_training_flops 4.33e+19
total_training_time 10949s (~3 hours)

[!NOTE]
Report total_training_time (in seconds) for leaderboard submissions, along with validation bpb for noise context.

Generating Samples

Every N steps (or disabled), the model generates text samples using current weights. These appear in the logging dashboard as examples of output quality, helping you qualitatively check coherence and capability mid-training. Samples use the configured sequence length and batch settings.

Saving and Loading Checkpoints

Checkpoints capture the full model state, optimizer state (sharded across GPUs), and metadata at specified steps. Files save to a directory named after your model-tag (e.g., d26_feb2_fp8_ratio8.25).

Checkpoint Files

| File Pattern | Contents | Usage | |—————————|———-|——-```

Generated by ESX Wiki