Running on CPU or Single GPU

This section covers running the full workflow—tokenizer training, base model training, supervised finetuning (SFT), evaluation, and chatting—on CPU or single GPU hardware, such as MacBooks with MPS support. It’s ideal for users with limited resources who want an educational demo to explore the product without high-end GPUs or significant costs. Expect slower performance and smaller-scale results compared to GPU-accelerated runs; training a small base model takes about 30 minutes on a high-end MacBook Pro. This builds directly on 2.1. Installation and Environment Setup and contrasts with the full-scale reproduction in 2.2. Reproducing GPT-2 Capability Model. For broader hardware tuning, see 9.1. Hardware and Precision Options; for production training, refer to 3. Training Base Models and 5. Training Chat Models.

Overview

The CPU/single GPU workflow uses reduced model sizes, sequence lengths, and batch sizes to fit within memory constraints and avoid out of memory (OOM) errors. It demonstrates the end-to-end process: preparing a tokenizer on ~2 billion characters, training a compact 6-layer base model over 5000 iterations, evaluating it, applying SFT for 1500 iterations, and interacting via CLI or web chat. All steps activate a CPU-optimized environment automatically.

Key capabilities:

Automatic environment setup with CPU extras.
Tuned parameters for completion in under an hour on capable laptops.
Direct progression to chatting, where the model can answer simple questions like capitals or colors.

graph TB
  subgraph "Environment Setup"
    Start["Launch CPU workflow"] --> Env["Install dependencies<br/>Activate CPU environment<br/>Set dummy run ID if needed"]
  end
  subgraph "Tokenizer Phase"
    Env --> TokTrain["Train on *~2B characters*"]
    TokTrain --> TokEval["Evaluate tokenizer"]
  end
  subgraph "Base Model Phase"
    TokEval --> BaseTrain["Train 6-layer model<br/>5000 iterations<br/>Batch sizes: device=*32*, total=*16384*"]
    BaseTrain --> BaseEval["Evaluate base model<br/>Device batch=*1*"]
  end
  subgraph "Chat Model Phase"
    BaseEval --> SFT["SFT on identity conversations<br/>1500 iterations<br/>Batch sizes: device=*32*, total=*16384*"]
  end
  subgraph "Interaction"
    SFT --> CLI["CLI chat<br/>(e.g., 'What is the capital of France?')"] 
    SFT --> Web["Web chat UI"]
  end

Environment Setup

Begin by executing the CPU demo workflow, which handles installation of CPU-optimized dependencies, creates a virtual environment, and activates it. It also ensures a run identifier is set (using dummy if none provided) for tracking.

[!NOTE]
This step syncs CPU extras, enabling MPS on Apple silicon for faster performance than pure CPU.

Tokenizer Training and Evaluation

Train a tokenizer on approximately 2 billion characters (~34 seconds on MacBook Pro M3 Max), then evaluate it.

Field	Required	Accepted Values	Description
max-chars	Yes	Number (e.g., 2000000000)	Limits training data to this many characters for quick demo.

Cross-reference: Full details in 4. Tokenizer Training and Evaluation.

Base Model Training and Evaluation

Train a small 6-layer base model tuned for ~30 minutes on MacBook Pro M3 Max, with window pattern L, max sequence length 512, evaluations every 100 iterations, and sampling every 100 iterations. Follow with base evaluation using reduced batching.

Setting	Default (CPU)	Options	What It Controls
depth	6	Positive integer (e.g., 4-12)	Number of model layers; smaller fits low memory.
head-dim	64	Positive integer (e.g., 64)	Attention head dimension; reduces compute.
window-pattern	L	L, LQ, etc.	Sequence handling pattern; L for linear efficiency.
max-seq-len	512	Positive integer (e.g., 512)	Maximum tokens per sequence; lower avoids OOM.
device-batch-size	32 (train), 1 (eval)	Positive integer	Samples per device; tune down for memory.
total-batch-size	16384	Positive integer (multiple of device batch)	Global batch across devices; scales with hardware.
eval-every	100	Positive integer or -1	Iterations between evaluations.
eval-tokens	524288	Positive integer	Tokens used in each evaluation.
core-metric-every	-1 (disabled)	Positive integer or -1	Frequency of core metric computation.
sample-every	100	Positive integer	Frequency of model sampling outputs.
num-iterations	5000	Positive integer	Total training steps.

[!WARNING]
Increasing iterations or sizes may cause OOM; monitor memory usage.

Cross-reference: See 3. Training Base Models and 6.1. Base Model Evaluation.

Supervised Finetuning (SFT)

Download identity_conversations.jsonl dataset automatically, then apply SFT (~10 minutes on MacBook Pro M3 Max) using similar batch sizes and max sequence 512, with evaluations every 200 iterations.

Cross-reference: Full SFT details in 5.1. Supervised Finetuning (SFT).

Chatting with the Model

After SFT:

CLI Chat: Start a command-line interface; type prompts like “What is the capital of France?” or “Hi, what color is the sky?”. The model should respond with basics like Paris or blue.
Web Chat UI: Launch a ChatGPT-style web interface for interactive conversations.

Cross-reference: 7.1. Web Chat UI and 7.2. CLI Chat.

Batch Size Tuning Guidelines

Adjust these to prevent OOM errors on your hardware. Start conservative and increase gradually.

Hardware	Recommended device-batch-size (Train)	Recommended total-batch-size	Expected Time (Base Train)	Notes
MacBook Pro M3 Max (MPS)	32	16384	~30 minutes	Tuned baseline.
Standard Laptop CPU	8-16	4096-8192	1-2 hours	Halve if OOM occurs.
Single Low-End GPU	16	8192	~45 minutes	Monitor VRAM.
Very Low RAM (<16GB)	4	2048	2+ hours	Disable evals if needed.

Troubleshooting

Common issues during low-hardware runs:

Message	Severity	Meaning
Out of memory (OOM) during training or eval	Error	Hardware can’t fit the batch/sequence. Reduce device-batch-size by half and retry; see tuning table above.
MPS not available on Mac	Warning	Falling back to CPU; ensure Apple silicon and latest OS. Performance slower.
No improvement in metrics after many iterations	Info	Demo-scale limits; for better results, use GPU and increase num-iterations (see 3. Training Base Models).
Dataset download failed for SFT	Error	Network issue; manually download identity_conversations.jsonl to cache dir and retry.

Summary

Run the CPU/single GPU workflow for a quick end-to-end demo fitting laptops, with auto-setup and tuned params to avoid OOM.
Key phases: tokenizer (2B chars), base model (6 layers, 5000 iterations), SFT (1500 iterations), and chat testing.
Tune device-batch-size and total-batch-size per hardware; use the guidelines table.
For scaling up, see 2.2. Reproducing GPT-2 Capability Model, 3. Training Base Models, and 9.1. Hardware and Precision Options.
Interact via 7. Chatting with Models once complete.

Generated by ESX Wiki