Chat Model Evaluation — nanochat Guide

This section covers evaluating chat-tuned models on instruction-following benchmarks such as math problems, coding tasks, and science question answering, using task mixtures or sequential prompting formats. It’s designed for users who have completed supervised finetuning (SFT) in Training Chat Models and want objective metrics on chat capabilities, beyond base model evaluations in 6.1. Base Model Evaluation. Results help compare against leaderboards or prior runs, and inform iterations before testing in Chatting with Models. For full leaderboard submission, see 10. Leaderboard and Optimization.

Overview

Chat model evaluation runs your finetuned model through standardized benchmarks to measure performance in real-world chat scenarios. It supports task mixtures (batched prompts across multiple benchmarks) or sequences (chained interactions per task), producing centered accuracy scores adjusted for baselines. You’ll see progress updates in the terminal, final aggregated scores, and optional logging to external tools. This fits after SFT, providing quick feedback on improvements from data mixtures like MMLU and GSM8K.

Supported Benchmarks

Evaluations cover key chat domains with few-shot or zero-shot prompting.

Benchmark	Category	Description	Prompt Style	Typical Output
GSM8K	Math	Grade-school math word problems requiring step-by-step reasoning.	Few-shot chain-of-thought	Pass@1 accuracy (e.g., 87.2%)
HumanEval	Coding	Generate functional Python code from docstring descriptions.	Zero-shot	Pass@1 accuracy (e.g., 65.4%)
MMLU (subset)	Science QA	Multiple-choice questions across STEM subjects like biology, physics.	Few-shot	Accuracy (e.g., 52.1%)
GPQA (diamond)	Science QA	Graduate-level questions in biology, chemistry, physics.	Zero-shot	Accuracy (e.g., 28.9%)

[!NOTE]
Benchmarks use centered accuracy: (model accuracy - random baseline) / (1 - random baseline). Aggregated scores average across tasks for a single composite metric.

Running an Evaluation

Ensure your chat model checkpoint is saved from SFT (e.g., in the model-tag directory).
Launch the evaluation from the command line, specifying your checkpoint and options.
Monitor terminal output for per-task progress (e.g., “Evaluating GSM8K: 50/100 samples complete”).
Review final scores printed at the end, such as “Composite Chat Score: 0.623”.

graph TB
    subgraph "Preparation"
        A["Locate SFT checkpoint<br/>from [Training Chat Models](training-chat-models.md)"] --> B["Select benchmarks<br/>(e.g., math, coding)"]
    end
    subgraph "Evaluation Run"
        B -->|"starts process"| C["Load model<br/>Warm up cache"]
        C -->|"runs task mixture"| D{"Process samples<br/>per benchmark"}
        D -->|"few-shot prompts"| E["Generate responses<br/>Score automatically"]
    end
    E --> F["Display scores<br/>Terminal + logs"]
    F -->|"optional: save JSON"| G["Compare vs baseline<br/>e.g., GPT-2 chat equiv."]

Configuration Options

Customize runs via settings passed at launch.

Setting	Default	Options	What It Controls
max-samples-per-task	100	Positive integer (e.g., 50, 200)	Limits samples evaluated per benchmark to control runtime.
task-mixture	balanced	balanced, math-heavy, coding-heavy, sequence	Mix of benchmarks (balanced averages all; sequence chains multi-turn).
few-shot-k	5	Integer 0-32	Number of examples in prompts (higher improves reasoning tasks).
output-format	terminal	terminal, json, wandb	Where scores appear (wandb logs to Weights & Biases for tracking).
device-batch-size	auto	Power of 2 (e.g., 8, 16)	Samples processed simultaneously (lower if memory errors occur).
model-path	latest-sft	Full path to checkpoint	Loads specific finetuned model (required).

[!WARNING]
High few-shot-k or max-samples-per-task increases runtime and memory use. Start small for testing.

Viewing Results

Terminal output: Per-task accuracies (e.g., “GSM8K: 89.5% (centered: 0.912)”), composite score, and total time.
JSON export: Detailed per-sample predictions and ground truths for manual review.
Comparisons: Scores auto-compare to baselines like GPT-2 equivalents (e.g., “Beats GPT-2 chat by +12% on math”).

Example composite: Chat Score: 0.623 (average centered accuracy across 4 tasks).

Troubleshooting

Message	Severity	Meaning
“Model checkpoint not found at model-path”	Error	Check path from SFT run; reload latest from model-tag.
“Out of memory on device-batch-size=16”	Error	Reduce to 8 or 4; clear GPU cache or use CPU fallback.
“Benchmark data unavailable (e.g., GSM8K)”	Warning	Download missing datasets automatically or check internet.
“Low score variance across runs (>5%)”	Info	Normal for small samples; increase max-samples-per-task for stability.

[!NOTE]
Noisy scores? Run multiple times and average. For leaderboard, use full samples (max-samples-per-task=-1).

Summary

Evaluate chat models on math (GSM8K), coding (HumanEval), and science QA (MMLU, GPQA) via task mixtures or sequences for composite scores.
Configure with max-samples-per-task, task-mixture, and few-shot-k for flexible testing.
View centered accuracies in terminal/JSON/W&B; compare to baselines for progress.
Integrates post-Training Chat Models; precedes Chatting with Models and 10. Leaderboard and Optimization.
For base evals, see 6.1. Base Model Evaluation; track optimizations in 9. Configuration Reference.

Generated by ESX Wiki