Tokenizer Training and Evaluation

This section covers Tokenizer Training and Evaluation, tools for end users to create and assess custom Byte Pair Encoding (BPE) tokenizers optimized for your pretraining data. It’s designed for users preparing base models or chat models, ensuring efficient text compression before model training. The trained tokenizer integrates directly into data loading during training workflows. For overall training setup, see Training Base Models. For using tokenizers in chat or evaluation, see Chatting with Models and Model Evaluation. Configuration details appear in Configuration Reference.

Overview

Tokenizer training builds a vocabulary from your pretraining dataset, learning merges that compress text into fewer, longer tokens for better model efficiency. Evaluation measures compression quality (e.g., tokens per byte) across English, non-English, code, math, and science texts, comparing against standard tokenizers like GPT-2 and GPT-4. Once trained and validated, the tokenizer is automatically used in data pipelines for model training, providing consistent tokenization with special tokens for document boundaries and chat formats.

Training a Tokenizer

Training processes a stream of text documents from the pretraining dataset (train split), cropping each to a maximum length and stopping after a total character limit. It produces a BPE model in GPT-4 style, saves it for reuse, and computes byte lengths per token for evaluation metrics.

Steps to Train

Prepare your pretraining data as described in Getting Started.
Run the training command with optional parameters to customize dataset usage and vocabulary.
Monitor console output for progress: it shows processed limits, training duration, and save location.
The system automatically saves the tokenizer and supporting files (including per-token byte counts) to the standard output directory.

[!NOTE]
Training uses all available data shards until limits are hit, flattening batches into a continuous text stream.

Training Configuration

Special Tokens

The tokenizer reserves fixed special tokens for structure, added automatically during training. These do not consume training merges.

Token	Purpose
<\|bos\|>	Marks start of every document (required for training data).
<\|user_start\|>	Begins user messages (chat finetuning).
<\|user_end\|>	Ends user messages.
<\|assistant_start\|>	Begins assistant responses.
<\|assistant_end\|>	Ends assistant responses.
<\|python_start\|>	Starts Python tool invocation.
<\|python_end\|>	Ends Python tool invocation.
<\|output_start\|>	Starts tool output.
<\|output_end\|>	Ends tool output.

Evaluating a Tokenizer

Evaluation tests compression on held-out or sample texts without command-line options—it runs predefined benchmarks automatically. It encodes texts like news articles, Korean, code snippets, math documents, science passages, and validation data, reporting token counts and bits-per-byte ratios. Comparisons highlight improvements over GPT-2 (smaller vocab) and GPT-4 (similar style).

Steps to Evaluate

Ensure a tokenizer exists from training (or use pretrained like GPT-2/GPT-4).
Run the evaluation command.
Review console output: tables of tokens used per text sample, ratios, and vocab sizes for each tokenizer.

[!NOTE]
Lower tokens-per-character or bits-per-byte indicates better compression. Validation data uses your train/valid splits for realism.

Tokenizer Workflow

graph TB
    subgraph "Training Phase"
        Config["Set **max-chars**, **doc-cap**,<br/>**vocab-size**"]
        Train["Process dataset stream<br/>Crop docs, train BPE merges"]
    end
    Config -->|"Run training"| Train
    Train -->|"Save tokenizer + token bytes"| Eval
    subgraph "Evaluation Phase"
        Eval["Encode sample texts<br/>(news, code, non-English, etc.)<br/>Compare GPT-2 / GPT-4 / yours"]
    end
    Eval -->|"Validate compression"| Use
    subgraph "Integration Phase"
        Use["Load in data loader<br/>for model training<br/>(auto-prepends **<\|bos\|>**)"]
    end

Troubleshooting

Common messages appear in the console during training or evaluation.

Message	Severity	Meaning
max-chars: X,XXX,XXX,XXX doc-cap: X,XXX vocab-size: X,XXX,XXX	Info	Confirms your configuration settings before starting.
Training time: XX.XXs	Info	Total duration completed; longer runs yield better tokenizers. If unexpectedly slow, check dataset access in Getting Started.
Saved tokenizer to [path] or Saved token_bytes to [path]	Info	Files ready for use; verify directory for manual inspection if needed.
token_bytes_min: X token_bytes_max: X token_bytes_mean: X.X token_bytes_std: X.X	Info	Statistics on non-special token lengths (in bytes); aim for mean ~3-4 for good compression.
num_special_tokens: X	Info	Confirms special token count (should be 9); mismatch indicates config issue.

[!WARNING]
If training stops early without “Training time” message, dataset may lack enough characters—increase shards via Getting Started.

Summary

Train custom BPE tokenizers on your data with configurable limits for optimal compression matching your domain.
Evaluate via automated benchmarks comparing token efficiency to GPT-2/GPT-4 baselines.
Special tokens enable seamless use in Training Base Models, Training Chat Models, and Chatting with Models.
Follow the workflow: train → evaluate → integrate into data loaders for Model Evaluation.
Adjust via Configuration Reference for hardware tweaks during long runs.

Generated by ESX Wiki