Base Model Evaluation — nanochat Guide

This section covers Base Model Evaluation, the process of assessing pretrained language models using key metrics like the CORE score, validation bits per byte (val_bpb), and perplexity. It’s designed for users training base models who want to benchmark performance against the GPT-2 baseline (target CORE > 0.2565) and monitor training quality. This fits into the broader model development workflow after Training Base Models and before Training Chat Models or Chatting with Models. For full evaluation pipelines, see Model Evaluation; for leaderboard comparisons, see Leaderboard and Optimization.

Overview

Base model evaluation measures general capabilities through the CORE composite score (a mean of centered accuracies across 22 tasks in world knowledge, language understanding, commonsense reasoning, symbolic problem solving, and reading comprehension) and efficiency metrics like val_bpb (bits needed to encode validation data, lower is better) and perplexity (exponential of val_bpb times log base 2, lower is better). Users trigger evaluations during training to track progress toward GPT-2 parity, with outputs visible in logs and summaries.

Key Metrics

These metrics appear in training logs, summaries, and monitoring tools after evaluation runs.

Metric	Description	GPT-2 Target / Example
CORE	Single score from 22 few-shot tasks; accounts for random baselines via centering. Higher is better.	> 0.2565
val_bpb	Validation loss in bits per byte; indicates data compression efficiency. Lower is better.	~ 0.748 (e.g., 0.74833)
Perplexity	Prediction uncertainty measure; derived from val_bpb. Lower is better.	Derived (e.g., ~ 1.68)

[!NOTE]
CORE can be noisy due to task variance; cross-check with val_bpb for stable progress signals.

Configuring Evaluation

Control evaluation frequency and scope via training settings to balance compute cost and monitoring needs. Full CORE runs all 22 tasks; partial limits tasks for speed.

Setting	Default	Options	What It Controls
CORE Metric Every	999999	Positive integer (steps) or -1 (off)	Steps between full CORE evaluations; high values run only at end.
CORE Metric Max Per Task	-1	Positive integer or -1 (full)	Max examples per task; limits for faster partial evals.
Sample Every	Varies	Positive integer (steps) or -1 (off)	Frequency of generation samples for qualitative checks.
Save Every	Varies	Positive integer (steps) or -1 (off)	Checkpoint frequency; aligns with evals for recovery.

Running Base Model Evaluation

Evaluations integrate into base model training workflows. Trigger them periodically or once at the end for final benchmarking.

Start a training run with evaluation enabled (e.g., set CORE Metric Every to a step count like 3000 for periodic checks or 999999 for end-only).
Monitor live logs for intermediate val_bpb and perplexity after each validation batch.
At evaluation steps, watch for CORE computation progress (task-by-task if partial).

Review final summary for key metrics:

core_metric *0.25851*
step *16704*
total_training_flops *4.33e+19*
total_training_time *10949s*

Compare CORE against 0.2565; note val_bpb for consistency.
Save or load checkpoints post-eval for resumption.

graph TB
  subgraph "Training Run"
    Start["Start Training<br/>with Eval Config"] --> Monitor["Monitor Logs<br/>val_bpb / Perplexity"]
    Monitor --> EvalTrigger{"CORE Metric Every<br/>reached?"}
    EvalTrigger -->|Yes| ComputeCORE["Run CORE<br/>22 Tasks or Partial"]
    ComputeCORE --> Summary["Log Summary<br/>CORE, val_bpb, Time"]
    EvalTrigger -->|No| Continue["Next Step"]
    Continue --> Monitor
  end
  Summary --> Decide["Compare to GPT-2<br/>CORE > 0.2565?"]
  Decide -->|Yes| Success["Report / Leaderboard"]
  Decide -->|No| Adjust["Tune Config<br/>e.g., Ratio, Depth"]

[!WARNING]
Full CORE evals are compute-intensive; use high CORE Metric Every values or partial (CORE Metric Max Per Task >0) during early training to avoid slowdowns.

Troubleshooting

Common issues and messages from logs or summaries.

Message	Severity	Meaning
“CORE metric computation started”	Info	Full/partial eval running; expect delay proportional to tasks. Wait or check progress.
CORE score variance across runs (e.g., 0.25-0.26)	Warning	Normal noise; rerun or average multiple evals. Verify with stable val_bpb.
OOM during CORE eval	Error	Insufficient GPU memory for tasks; reduce CORE Metric Max Per Task, batch size, or use CPU fallback.
val_bpb not improving	Warning	Stalled training; check data quality, learning rate, or increase training horizon. See Training Base Models.

Summary

Achieve GPT-2 parity with CORE > 0.2565, validated by low val_bpb (~ 0.748) and perplexity.
Configure CORE Metric Every and CORE Metric Max Per Task for efficient monitoring during Training Base Models.
Use final summaries for leaderboard checks: Leaderboard and Optimization.
For chat-tuned evals, proceed to 6.2. Chat Model Evaluation; for usage, see Chatting with Models.

Generated by ESX Wiki