Brandon-Tiny 10M is an ultra-small instruction-following language model that outperforms models 5x its size on standard benchmarks. Built on a deep-narrow Llama 2 architecture with DenseFormer, Value Residual, and Register Tokens, trained entirely on a single RTX 3090.
Benchmark Results
| Benchmark | Score | Random | Delta |
|---|---|---|---|
| BLiMP (Grammar) | 73.3% | 50.0% | +23.3 |
| HellaSwag (Commonsense) | 32.4% | 25.0% | +7.4 |
| ARC-Easy (Science) | 30.6% | 25.0% | +5.6 |
| PIQA (Physical Intuition) | 54.7% | 50.0% | +4.7 |
| LAMBADA (Last Word) | 8.8% | 0.0% | +8.8 |
| Wikitext-2 PPL | 224.2 | โ | โ |
For comparison: the Super Tiny Language Models paper's 50M model (5x our size) scored 25.6% on HellaSwag and 21% on ARC-Easy. Our 10.7M model beats both. (Note: differences in tokenizer, eval subset size, and data may contribute to the gap.)
3-Phase Training Pipeline
The key innovation: each phase compensates for the limitations of the previous one.
Wiki + SmolLM + Synthetic
30M teacher โ 10M student
75K examples + anti-repetition
Result: 10M Optimal (val_loss 2.40) outperforms all 30M models (best: 2.61). Training methodology > parameter count.
Architecture
| Type | Llama 2 decoder-only, deep-narrow (MobileLLM) |
| Parameters | 10,706,776 |
| Dimensions | dim=256, hidden=720 |
| Layers | 24 (12 unique, block sharing) |
| Attention | 8 heads, 2 KV heads (GQA 4:1) |
| Enhancements | DenseFormer + Value Residual + Register Tokens |
| Tokenizer | 8,192 BPE (SentencePiece), ChatML format |
| Max Sequence | 512 tokens |
| Model Size | 42.8 MB (fp32) / 21.4 MB (bf16) |
Quick Start
# Install
pip install torch sentencepiece pyyaml numpy datasets
# Generate text
python scripts/chat.py --checkpoint checkpoints/10m_optimal/phase3_finetune/best.pt
# Run the full 3-phase training pipeline
python scripts/train_10m_optimal.py
# Run benchmarks
python scripts/benchmark_serious.py
Model Variants
We trained 8 variants to understand what works at this scale:
| Model | Params | Finetune Loss | Notes |
|---|---|---|---|
| 10M Optimal | 10.7M | 2.40 | 3-phase pipeline, best overall |
| 10M Enhanced v2 | 10.7M | 2.92 | DenseFormer+VR+Registers |
| 10M Dream | 10.7M | 2.98 | Ternary weights (failed) |
| 10M Synthetic-only | 10.7M | 3.62 | Synthetic data = poor transfer |
| 30M v2 Original | 30.0M | 2.61 | Best 30M, but 10M Optimal beats it |
| 30M v2 Wiki | 30.0M | 2.80 | Wikipedia pretrain |
| 30M Dream | 31.1M | 4.22 | Ternary, worst overall |
Technical Report
Read the full technical report with architecture details, ablation studies, and analysis:
Citation
@misc{brandon-tiny-2026,
title={Brandon-Tiny 10M: A 3-Phase Training Pipeline for
Ultra-Small Instruction-Following Language Models},
author={Samuel Cortes},
year={2026},
url={https://xaskasdf.github.io/brandon-tiny/}
}