Brandon-Tiny — Ultra-Small Instruction-Following Language Models

10.7MParameters

73.3%BLiMP Grammar

32.4%HellaSwag

~7 hrsTraining Time

Brandon-Tiny 10M is an ultra-small instruction-following language model that outperforms models 5x its size on standard benchmarks. Built on a deep-narrow Llama 2 architecture with DenseFormer, Value Residual, and Register Tokens, trained entirely on a single RTX 3090.

Benchmark Results

Benchmark	Score	Random	Delta
BLiMP (Grammar)	73.3%	50.0%	+23.3
HellaSwag (Commonsense)	32.4%	25.0%	+7.4
ARC-Easy (Science)	30.6%	25.0%	+5.6
PIQA (Physical Intuition)	54.7%	50.0%	+4.7
LAMBADA (Last Word)	8.8%	0.0%	+8.8
Wikitext-2 PPL	224.2	—	—

For comparison: the Super Tiny Language Models paper's 50M model (5x our size) scored 25.6% on HellaSwag and 21% on ARC-Easy. Our 10.7M model beats both. (Note: differences in tokenizer, eval subset size, and data may contribute to the gap.)

3-Phase Training Pipeline

The key innovation: each phase compensates for the limitations of the previous one.

Phase 1 Foundation Pretrain 15K steps, WSD schedule
Wiki + SmolLM + Synthetic

Phase 2 Knowledge Distillation 7.5K steps, reverse KLD
30M teacher → 10M student

Phase 3 Instruction Finetune 12K steps, cosine LR
75K examples + anti-repetition

Result: 10M Optimal (val_loss 2.40) outperforms all 30M models (best: 2.61). Training methodology > parameter count.

Architecture

Type	Llama 2 decoder-only, deep-narrow (MobileLLM)
Parameters	10,706,776
Dimensions	dim=256, hidden=720
Layers	24 (12 unique, block sharing)
Attention	8 heads, 2 KV heads (GQA 4:1)
Enhancements	DenseFormer + Value Residual + Register Tokens
Tokenizer	8,192 BPE (SentencePiece), ChatML format
Max Sequence	512 tokens
Model Size	42.8 MB (fp32) / 21.4 MB (bf16)

Quick Start

# Install
pip install torch sentencepiece pyyaml numpy datasets

# Generate text
python scripts/chat.py --checkpoint checkpoints/10m_optimal/phase3_finetune/best.pt

# Run the full 3-phase training pipeline
python scripts/train_10m_optimal.py

# Run benchmarks
python scripts/benchmark_serious.py

Model Variants

We trained 8 variants to understand what works at this scale:

Model	Params	Finetune Loss	Notes
10M Optimal	10.7M	2.40	3-phase pipeline, best overall
10M Enhanced v2	10.7M	2.92	DenseFormer+VR+Registers
10M Dream	10.7M	2.98	Ternary weights (failed)
10M Synthetic-only	10.7M	3.62	Synthetic data = poor transfer
30M v2 Original	30.0M	2.61	Best 30M, but 10M Optimal beats it
30M v2 Wiki	30.0M	2.80	Wikipedia pretrain
30M Dream	31.1M	4.22	Ternary, worst overall

Technical Report

Read the full technical report with architecture details, ablation studies, and analysis:

Brandon-Tiny 10M: A 3-Phase Training Pipeline for Ultra-Small Instruction-Following Language Models →

Citation

@misc{brandon-tiny-2026,
  title={Brandon-Tiny 10M: A 3-Phase Training Pipeline for
         Ultra-Small Instruction-Following Language Models},
  author={Samuel Cortes},
  year={2026},
  url={https://xaskasdf.github.io/brandon-tiny/}
}