This section covers hardware and precision configuration options for training and inference in nanochat, aimed at users optimizing runs on their available hardware—from single GPUs or CPUs to multi-GPU setups. These settings control memory efficiency, training speed, and model quality by balancing precision, batch sizes, and resource utilization. They integrate with model sizing in 3.1 Configuring Model Size and Training Horizon and performance tuning in 10 Leaderboard and Optimization. For full training workflows, see 3 Training Base Models; for evaluation impacts, see 6 Model Evaluation.
Overview
Hardware and precision options let you adapt nanochat to your setup, such as enabling low-precision training to fit larger models or tuning batch sizes to avoid memory errors. Key capabilities include:
- Automatic device detection (GPU, Apple Silicon, or CPU).
- Precision modes for faster training with minimal quality loss.
- Dynamic batch size adjustments with gradient accumulation to maintain effective training scale.
- Multi-GPU scaling via standard launchers.
These ensure efficient use of resources while targeting metrics like the CORE score.
Device Selection
nanochat automatically detects and uses the best available hardware:
- Primary GPU (NVIDIA CUDA) if present.
- Apple Silicon (MPS) as fallback.
- CPU for lightweight testing or unsupported setups.
You see a startup message like “Autodetected device type: cuda”. Override via environment variables if needed (e.g., for testing CPU-only runs).
For multi-GPU, launch with a distributed command like torchrun --standalone --nproc_per_node=*N*, where N is your GPU count (e.g., 8). nanochat distributes data automatically across GPUs.
[!NOTE]
Set OMP_NUM_THREADS=1 before multi-GPU launches to avoid threading conflicts and maximize throughput.
Precision Options
Choose between high-speed low-precision (FP8) and stable higher-precision (bfloat16) modes:
- FP8: Enables on supported GPUs (e.g., H100); reduces memory and speeds up training, allowing larger models or batches. Slightly lower quality requires minor horizon adjustments (see target-param-data-ratio).
- bfloat16 (default): Falls back automatically if FP8 unsupported; more stable for older GPUs.
Toggle with the –fp8 flag. You observe faster step times and logs like “Using FP8 precision” during training.
| Setting | Default | Options | What It Controls |
|---|---|---|---|
| –fp8 | off | on/off | Activates FP8 training for memory/speed gains; auto-falls to bfloat16 if unsupported. Reduces effective training horizon needed for same quality. |
Batch Size and Gradient Accumulation
device-batch-size sets tokens per device per forward pass. nanochat targets a global batch size (e.g., ~500K tokens for GPT-2 scale) and automatically computes gradient accumulation steps:
- Ideal: 32 per device (powers of 2 for clean math).
- If memory-limited, reduce to 16, 8, etc.—system accumulates gradients over more micro-steps to match target.
Example: device-batch-size=16 on 8 GPUs with accumulation=2 yields effective batch of 512K tokens.
Adjust target-param-data-ratio (tokens per parameter) alongside for optimal scaling (e.g., 8.25 for undertraining larger models).
| Setting | Default | Accepted Values | What It Controls |
| ——— | ——— | —————– | ——————```markdown |
| device-batch-size | 32 | Powers of 2 (e.g., 32, 16, 8, 4) | Per-device batch; auto-triggers accumulation to hit global target. Lower values prevent OOM. |
| target-param-data-ratio | 10.5 | Positive float (e.g., 8.25, 12) | Training horizon (tokens/param); tune down with FP8, up for compute-optimal. |
Model Depth Integration
Pair with –depth (e.g., 26 for ~GPT-2 scale) from 3.1 Configuring Model Size and Training Horizon. Larger depths need smaller batches/FP8.
Tuning Workflow
Follow this process to optimize for your hardware:
graph TB
subgraph "Initial Setup"
A["Set **--depth**<br/>e.g., 24-26"] --> B["Enable **--fp8**<br/>if supported"]
end
subgraph "Batch Tuning"
B --> C["Try **device-batch-size=32**"]
C --> D{"Fits?"}
D -->|Yes| E["Run training<br/>Monitor MFU >40%"]
D -->|No (OOM)| F["Halve to 16<br/>Rerun"]
F --> G{"Still OOM?"}
G -->|Yes| H["Halve again (8/4)<br/>Or reduce depth"]
G -->|No| E
end
E --> I["Tune **target-param-data-ratio**<br/>e.g., 8.25-12 for CORE target"]
subgraph "Validation"
I --> J["Check CORE >0.256<br/>via eval logs"]
end
Troubleshooting
Common issues stem from memory limits during training.
| Message | Severity | Meaning |
|---|---|---|
| Out of memory (OOM) during forward/backward pass | Error | Batch too large for GPU VRAM. Reduce device-batch-size (e.g., 32→16), enable –fp8, or lower –depth. Restart from last checkpoint. |
| CUDA out of memory. Tried to allocate X GB (browser console or terminal) | Error | Same as above; common on consumer GPUs (<24GB). Use gradient accumulation—system auto-adjusts. |
| Autodetected device type: cpu (unexpected) | Info | No GPU/MPS found. Install CUDA or use Apple Silicon; training slows 10-100x. |
| MFU <30% in progress logs | Warning | Low utilization. Check multi-GPU launch (nproc_per_node), set OMP_NUM_THREADS=1, or verify FP8 support. |
[!WARNING]
Irreversible OOM can corrupt checkpoints—always set –save-every low initially.
Summary
- Enable –fp8 for speed on modern GPUs; fallback to bfloat16 is automatic.
- Tune device-batch-size (start at 32, halve on OOM) with auto-gradient accumulation.
- Use multi-GPU via
torchrun --nproc_per_node=*N*for scaling. - Integrate with 3.1 Configuring Model Size and Training Horizon (–depth) and 10 Leaderboard and Optimization for leaderboards.
- Monitor via logs; target MFU >40% and CORE >0.256 for GPT-2 parity. See 3.2 Monitoring and Checkpoints for details.