Hardware and Precision Options

This section covers hardware and precision configuration options for training and inference in nanochat, aimed at users optimizing runs on their available hardware—from single GPUs or CPUs to multi-GPU setups. These settings control memory efficiency, training speed, and model quality by balancing precision, batch sizes, and resource utilization. They integrate with model sizing in 3.1 Configuring Model Size and Training Horizon and performance tuning in 10 Leaderboard and Optimization. For full training workflows, see 3 Training Base Models; for evaluation impacts, see 6 Model Evaluation.

Overview

Hardware and precision options let you adapt nanochat to your setup, such as enabling low-precision training to fit larger models or tuning batch sizes to avoid memory errors. Key capabilities include:

Automatic device detection (GPU, Apple Silicon, or CPU).
Precision modes for faster training with minimal quality loss.
Dynamic batch size adjustments with gradient accumulation to maintain effective training scale.
Multi-GPU scaling via standard launchers.

These ensure efficient use of resources while targeting metrics like the CORE score.

Device Selection

nanochat automatically detects and uses the best available hardware:

Primary GPU (NVIDIA CUDA) if present.
Apple Silicon (MPS) as fallback.
CPU for lightweight testing or unsupported setups.

You see a startup message like “Autodetected device type: cuda”. Override via environment variables if needed (e.g., for testing CPU-only runs).

For multi-GPU, launch with a distributed command like torchrun --standalone --nproc_per_node=*N*, where N is your GPU count (e.g., 8). nanochat distributes data automatically across GPUs.

[!NOTE]
Set OMP_NUM_THREADS=1 before multi-GPU launches to avoid threading conflicts and maximize throughput.

Precision Options

Choose between high-speed low-precision (FP8) and stable higher-precision (bfloat16) modes:

FP8: Enables on supported GPUs (e.g., H100); reduces memory and speeds up training, allowing larger models or batches. Slightly lower quality requires minor horizon adjustments (see target-param-data-ratio).
bfloat16 (default): Falls back automatically if FP8 unsupported; more stable for older GPUs.

Toggle with the –fp8 flag. You observe faster step times and logs like “Using FP8 precision” during training.

Setting	Default	Options	What It Controls
–fp8	off	on/off	Activates FP8 training for memory/speed gains; auto-falls to bfloat16 if unsupported. Reduces effective training horizon needed for same quality.

Batch Size and Gradient Accumulation

device-batch-size sets tokens per device per forward pass. nanochat targets a global batch size (e.g., ~500K tokens for GPT-2 scale) and automatically computes gradient accumulation steps:

Ideal: 32 per device (powers of 2 for clean math).
If memory-limited, reduce to 16, 8, etc.—system accumulates gradients over more micro-steps to match target.

Example: device-batch-size=16 on 8 GPUs with accumulation=2 yields effective batch of 512K tokens.

Adjust target-param-data-ratio (tokens per parameter) alongside for optimal scaling (e.g., 8.25 for undertraining larger models).

Setting	Default	Accepted Values	What It Controls
———	———	—————–	——————```markdown
device-batch-size	32	Powers of 2 (e.g., 32, 16, 8, 4)	Per-device batch; auto-triggers accumulation to hit global target. Lower values prevent OOM.
target-param-data-ratio	10.5	Positive float (e.g., 8.25, 12)	Training horizon (tokens/param); tune down with FP8, up for compute-optimal.

Model Depth Integration

Pair with –depth (e.g., 26 for ~GPT-2 scale) from 3.1 Configuring Model Size and Training Horizon. Larger depths need smaller batches/FP8.

Tuning Workflow

Follow this process to optimize for your hardware:

graph TB
    subgraph "Initial Setup"
        A["Set **--depth**<br/>e.g., 24-26"] --> B["Enable **--fp8**<br/>if supported"]
    end
    subgraph "Batch Tuning"
        B --> C["Try **device-batch-size=32**"]
        C --> D{"Fits?"}
        D -->|Yes| E["Run training<br/>Monitor MFU >40%"]
        D -->|No (OOM)| F["Halve to 16<br/>Rerun"]
        F --> G{"Still OOM?"}
        G -->|Yes| H["Halve again (8/4)<br/>Or reduce depth"]
        G -->|No| E
    end
    E --> I["Tune **target-param-data-ratio**<br/>e.g., 8.25-12 for CORE target"]
    subgraph "Validation"
        I --> J["Check CORE >0.256<br/>via eval logs"]
    end

Troubleshooting

Common issues stem from memory limits during training.

Message	Severity	Meaning
Out of memory (OOM) during forward/backward pass	Error	Batch too large for GPU VRAM. Reduce device-batch-size (e.g., 32→16), enable –fp8, or lower –depth. Restart from last checkpoint.
CUDA out of memory. Tried to allocate X GB (browser console or terminal)	Error	Same as above; common on consumer GPUs (<24GB). Use gradient accumulation—system auto-adjusts.
Autodetected device type: cpu (unexpected)	Info	No GPU/MPS found. Install CUDA or use Apple Silicon; training slows 10-100x.
MFU <30% in progress logs	Warning	Low utilization. Check multi-GPU launch (nproc_per_node), set OMP_NUM_THREADS=1, or verify FP8 support.

[!WARNING]
Irreversible OOM can corrupt checkpoints—always set –save-every low initially.

Summary

Enable –fp8 for speed on modern GPUs; fallback to bfloat16 is automatic.
Tune device-batch-size (start at 32, halve on OOM) with auto-gradient accumulation.
Use multi-GPU via torchrun --nproc_per_node=*N* for scaling.
Integrate with 3.1 Configuring Model Size and Training Horizon (–depth) and 10 Leaderboard and Optimization for leaderboards.
Monitor via logs; target MFU >40% and CORE >0.256 for GPT-2 parity. See 3.2 Monitoring and Checkpoints for details.

Generated by ESX Wiki