This section covers the CLI Chat, a lightweight, terminal-based interface for conversing with your trained nanochat models. It’s designed for end users who prefer command-line workflows, quick model testing, headless servers, or integration into scripts. The CLI Chat maintains full conversation context across multiple turns, enabling natural multi-turn dialogues. It complements the browser-based Web Chat UI for graphical interactions. To chat with models, first train or download them via Training Base Models, Training Chat Models, or the Leaderboard and Optimization. For hardware tweaks during chatting, see Hardware and Precision Options.
Overview
The CLI Chat provides a simple, interactive text interface in your terminal for testing model responses. Key capabilities include:
- Real-time message exchange: Type user messages and receive instant model replies.
- Automatic context tracking: Conversation history is preserved up to the model’s maximum context length, influencing future responses.
- Inline controls: Special commands to adjust settings, manage history, or export sessions without exiting.
- Streaming responses: Model outputs appear word-by-word for a responsive feel.
It’s optimized for single-user sessions on CPU or GPU, with low overhead for rapid iteration during model evaluation (see Model Evaluation).
Launching and Interface
To begin, open your terminal in the project directory and start the CLI Chat by specifying your trained model. The interface launches immediately, displaying a welcome banner with the model details (e.g., size, training run name) and current settings.
You’ll see:
- User: prompt (blinking cursor) for your input.
- Assistant: prefixed responses from the model.
- A status line at the top showing active context length, temperature, and device usage.
Example interaction flow:
- Type your message at the User: prompt (e.g., “Hello, what can you do?”).
- Press Enter to send.
- Watch the Assistant: response stream in real-time.
- The prompt returns to User: for the next turn.
To exit, type /exit or press Ctrl+C.
[!NOTE]
Conversations are ephemeral by default but can be saved/loaded for reuse.
Special Commands
During a chat, prefix commands with / at the User: prompt and press Enter. No model response is generated for commands.
| Command | Example | Description |
|---|---|---|
| /help | /help |
Lists all available commands and current settings. |
| /clear | /clear |
Resets conversation context to empty (starts fresh). |
| /save | /save my_chat.jsonl |
Exports full conversation history to a JSONL file (filename required). Each line is a turn with user/assistant messages. |
| /load | /load my_chat.jsonl |
Loads conversation history from a JSONL file (filename required), appending to current context. |
| /temp | /temp 1.2 |
Sets sampling temperature (0.0 to 2.0; default 0.8). Higher values increase creativity/randomness. |
| /topk | /topk 50 |
Sets top-K sampling (1 to 100; default 50). Limits token choices to top K probable ones. |
| /topp | /topp 0.9 |
Sets top-P (nucleus) sampling (0.0 to 1.0; default 0.9). Samples from smallest set summing to probability P. |
| /len | /len 1024 |
Sets maximum response length in tokens (64 to model’s context limit; default 512). |
| /status | /status |
Shows current context token count, settings, and device memory usage. |
Changes apply immediately to the next response.
Configuration Options
Launch the CLI Chat with optional flags to set defaults. These override in-chat commands and persist for the session.
| Setting | Default | Options | What It Controls |
|---|---|---|---|
| Model Tag | none | Name of a trained model (e.g., d26_feb2_fp8_ratio8.25) | Loads the specified checkpoint for chatting. Required for first use. |
| Model Path | none | Full path to checkpoint directory | Alternative to Model Tag for custom/local models. |
| Device | auto | cpu, cuda | Runs inference on CPU (slower) or GPU (faster). Detects available GPU if auto. |
| Temperature | 0.8 | 0.0 to 2.0 | Initial randomness for response generation. |
| Max Context | model max (e.g., 2048 tokens) | 512 to model max | Limits total conversation history to prevent overflow. Older messages are trimmed. |
| Batch Size | 1 | 1 to 8 | Number of parallel inferences (GPU only; increases throughput for repeated prompts). |
| Precision | auto | bf16, fp8, fp32 | Inference data type (fp8 fastest on supported GPUs). |
[!WARNING]
Exceeding device memory (e.g., large model on small GPU) causes automatic fallback to CPU. Monitor with/status.
Managing Context
Context is the full history of user/assistant turns, tokenized and fed to the model each response. It grows with each exchange until hitting Max Context.
- Automatic trimming: When full, oldest turns are dropped to fit new messages.
- Visual indicator: Status line shows used/total tokens (e.g., “Context: 1450/2048”).
- Save/load workflow:
- Type
/save session.jsonl. - Resume later: Launch CLI Chat, type
/load session.jsonl. - Context merges seamlessly.
- Type
graph TB
subgraph "New Session"
A["Launch CLI Chat<br/>Specify **Model Tag**"] --> B["**User:** prompt appears<br/>Context: 0 tokens"]
end
subgraph "Chatting"
B --> C["Type message<br/>Press **Enter**"] --> D["**Assistant:** streams response<br/>Context grows"]
D --> C
end
subgraph "Context Actions"
C --> E{"Special command?"}
E -->|**/clear**| F["Context reset to 0"]
E -->|**/save <file>**| G["Export JSONL<br/>Context unchanged"]
E -->|**/load <file>**| H["Append history<br/>Context updated"]
F --> C
G --> C
H --> C
end
C --> I["**/exit** or **Ctrl+C**<br/>Session ends"]
For long-term sessions, save frequently to avoid data loss on crashes.
Troubleshooting
Common issues and user-observable messages:
| Message | Severity | Meaning |
|---|---|---|
| “Model tag not found. List available with…” | Warning | Specified Model Tag doesn’t exist. Check trained models in your checkpoints directory or use Model Path. Relaunch with valid tag. |
| “Context overflow: trimming oldest turns.” | Info | History exceeded Max Context. Response still generates; use /clear or increase limit to retain more. |
| “Falling back to CPU: insufficient GPU memory.” | Warning | GPU OOM during load/inference. Reduce Batch Size, use smaller model, or switch to Device: cpu. |
| “Invalid JSONL: corrupted or wrong format.” | Error | /load failed. Ensure file has valid chat turns (user/assistant pairs). Regenerate with /save. |
| “No GPU detected, using CPU (slower responses).” | Info | Device: auto but no CUDA GPU. Expected on non-GPU machines; responses take longer (~10x). |
[!NOTE]
Check terminal output for device stats. Slow responses? Verify GPU with/statusand Hardware and Precision Options.
Summary
- CLI Chat offers terminal-based, context-aware conversations for efficient model testing.
- Interact via User:/Assistant: prompts with
/commands for controls like/clear,/save,/temp. - Configure at launch: Model Tag, Temperature, Device, etc., for customized sessions.
- Manage context visually with token counts; export to JSONL for persistence.
- Ideal companion to Web Chat UI; evaluate post-training with Model Evaluation or submit to Leaderboard and Optimization. For base/chat model prep, see Training Base Models and Training Chat Models.