This section covers Model Evaluation, a key step for users who have trained base or chat models and want to benchmark their performance objectively. It’s designed for end users assessing model quality after training, helping you compare capabilities against baselines like GPT-2 or leaderboard entries. Evaluations produce standardized scores such as the CORE metric for base models or pass@1 accuracy for chat tasks, which you can use to validate training results or submit to the Leaderboard and Optimization. For preparing models to evaluate, see Training Base Models and Training Chat Models. After evaluation, try interacting via Chatting with Models.
Overview
Model evaluation lets you measure your base models on next-token prediction efficiency (bits per byte) and in-context learning (CORE benchmark), or your chat models on reasoning benchmarks like math, coding, and multiple-choice QA. Run evaluations from the command line using multi-GPU setups for speed or single-device for quick checks. Results appear directly in the console as accuracies, centered scores, and summaries, with options to limit scope for faster runs.
Base Model Evaluation
Use base model evaluation to test raw language modeling capabilities without chat-specific finetuning. It supports three modes: CORE (in-context learning accuracy across 22 diverse tasks), BPB (bits per byte on validation data), and sample (text generation samples). Launch with a distributed command for GPUs or single-device for testing.
Running Base Model Evaluation
- Open your terminal in the project root.
- For a trained nanochat model: Run
torchrun --nproc_per_node=*(replace*with GPU count)-m scripts.base_eval --model-tag *your-model-tag*(e.g.,d24). - For a HuggingFace model: Add
--hf-path *path*(e.g.,openai-community/gpt2). -
Watch console output: It shows per-task progress like “Evaluating: task (0-shot, type: multiple_choice)… accuracy: 0.XXXX centered: 0.XXXX time: *X.XXs”, followed by final CORE metric, BPB values, and samples. - Results aggregate automatically: CORE score (average centered accuracy), train/val BPB, and generated text snippets.
[!NOTE]
First run downloads the eval bundle automatically if missing—expect a one-time ~GB download.
CORE Evaluation Details
The CORE metric tests in-context learning by prompting the model with few-shot examples and measuring accuracy on held-out items across categories like multiple-choice QA, schema matching, and language modeling. Scores are centered using (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline) to normalize against chance, then averaged.
| Task Category | Example Tasks | Few-Shot Levels | Scoring |
|---|---|---|---|
| Multiple Choice | PIQA, WinoGrande, ARC-Easy | 0-shot to 32-shot | Argmax over choices at continuation position |
| Schema | BoolQ, COPA, RTE | 0-shot to 32-shot | Prefix match on correct context |
| Language Modeling | LAMBADA, PIQA continuation | 0-shot to 32-shot | Next-token prediction after prompt |
Full results table prints per task with raw accuracy and centered value.
BPB and Sample Modes
- BPB: Measures compression efficiency as bits per byte on train/validation splits (lower is better, e.g., GPT-2 ~1.4).
- Sample: Generates short text completions from prompts, printed to console for qualitative review.
Chat Model Evaluation
Chat model evaluation benchmarks instruction-following and reasoning on standard tasks like coding (HumanEval), math (GSM8K), and knowledge (MMLU). Tasks are either generative (sample completions, check pass criteria) or categorical (predict best choice from logits). Results show pass rates like “num_passed/total (XX.XX%)”.
Running Chat Model Evaluation
- Open your terminal.
- List tasks with
-hor run directly:torchrun --nproc_per_node=* -m scripts.chat_eval -- -a *task-name*(e.g.,ARC-Easy). - Progress shows live: “Rank X | passed/total (XX.XX%)”.
- Final summary prints aggregated accuracy across devices.
| Task | Type | Description | Input | Output | Metric |
|---|---|---|---|---|---|
| ARC-Easy / ARC-Challenge | Categorical | Commonsense reasoning (easy/hard subsets) | Prompt with question + choices A-D | Predicted letter (A/B/C/D) | % correct choices |
| GSM8K | Generative | Grade-school math word problems | “Solve: problem” | Step-by-step solution + final numeric answer | % exact-match answers (multiple samples) |
| MMLU | Categorical | Multi-task knowledge (57 subjects) | Question + choices A-H | Predicted letter | % correct (test split) |
| HumanEval | Generative | Python coding problems | “Write function: docstring” | Executable code | % pass@1 (functional tests) |
| SpellingBee | Generative | Spell pangrams from letters | “Letters: set, make words” | List of valid words | % valid spellings (size=256) |
Generative vs. Categorical Workflows
graph TB
subgraph "Load Phase"
Load["Load Chat Model<br/>Checkpoint via --model-tag"]
end
subgraph "Eval Phase"
Prompt["Render Prompt<br/>for Task (e.g., ARC question + A/B/C/D)"]
GenCat{Type?}
Gen["Sample *N* Completions<br/>(temp=*t*, top_k=*k*, max_tokens=*X*)"] -->|Generative| EvalG["Check Criteria<br/>(e.g., exact answer match)"]
Cat["Batch Prompts<br/>Predict Logits @ Answer Pos"] -->|Categorical| EvalC["Argmax over Choices<br/>(e.g., best letter)"]
end
EvalG --> Score["Aggregate Pass Rate<br/>Across Problems & Devices"]
EvalC --> Score
Load --> Prompt
Configuration Options
Customize evaluations with these command-line flags (shared where applicable).
Base Model Flags
| Setting | Default | Accepted Values | What It Controls | |———|———|—————–|——————| | –eval | core,bpb,sample | Comma-separated: core, bpb, sample | Modes to run (mix/match) | | –hf-path | None | HuggingFace path (e.g., openai-community/gpt2) | Load external model instead of local | | –model-tag | None | Tag (e.g., d24) | Local checkpoint directory | | –step | Last | Integer step | Specific checkpoint version | | –max-per-task | All (-1) | Positive integer | Limit examples per CORE task (for speed) | | –device-batch-size | 32 | Positive integer | Batch size per device (BPB) | | –split-tokens | 20M | Positive integer | Tokens per train/val split (BPB) | | –device-type | Autodetect | cuda, cpu, mps | Target hardware |
Chat Model Flags
| Setting | Default | Accepted Values | What It Controls | |———|———|—————–|——————| | -a / –task | None (required) | ARC-Easy, ARC-Challenge, GSM8K, MMLU, HumanEval, SpellingBee | Benchmark to run | | –num-samples | 1 | Positive integer | Completions per generative problem | | –max-new-tokens | 512 | Positive integer | Generation length limit | | –temperature | 0.0 | Float [0.0-1.0+] | Sampling randomness | | –top_k | 50 | Positive integer | Top-K sampling | | –batch-size | 1 | Positive integer | Categorical batch size | | –max-problems | All | Positive integer | Limit problems (for speed) |
Troubleshooting
Common issues and console messages:
| Message | Severity | Meaning |
|---|---|---|
| “Eval bundle not found—downloading…” | Info | Automatic download starting; ensure internet access. |
| “accuracy: X.XXXX | centered: X.XXXX” | Info | Per-task result; lower centered scores indicate poor in-context learning—check training Monitoring and Checkpoints. |
| “Final: X/Y (Z.ZZ%)” | Info | Aggregated score; compare to baselines (e.g., GPT-2 CORE ~0.2). |
| “Invalid eval mode” | Error | Unknown –eval value; use only core, bpb, sample. |
| “Rank X | passed/total” stuck | Warning | Slow progress; reduce –max-per-task or use more GPUs via torchrun. |
Summary
- Evaluate base models with CORE (centered ICL accuracy), BPB (efficiency), and samples via
scripts.base_eval. - Test chat models on tasks like ARC, GSM8K, MMLU via
scripts.chat_eval -a *task*, with generative/categorical handling. - Use flags like –max-per-task for quick runs; results print to console for easy comparison.
- Relates to Training Base Models (checkpoints), Training Chat Models (SFT), and Leaderboard and Optimization (submissions). For interaction post-eval, see Chatting with Models.