Brandon-Tiny

10.7M params, runs on a PS2

10.7MParameters
73.3%BLiMP Grammar
32.4%HellaSwag
~7 hrsTraining Time

Brandon-Tiny 10M is an ultra-small instruction-following language model that outperforms models 5x its size on standard benchmarks. Built on a deep-narrow Llama 2 architecture with DenseFormer, Value Residual, and Register Tokens, trained entirely on a single RTX 3090.

Benchmark Results

BenchmarkScoreRandomDelta
BLiMP (Grammar)73.3%50.0%+23.3
HellaSwag (Commonsense)32.4%25.0%+7.4
ARC-Easy (Science)30.6%25.0%+5.6
PIQA (Physical Intuition)54.7%50.0%+4.7
LAMBADA (Last Word)8.8%0.0%+8.8
Wikitext-2 PPL224.2โ€”โ€”

For comparison: the Super Tiny Language Models paper's 50M model (5x our size) scored 25.6% on HellaSwag and 21% on ARC-Easy. Our 10.7M model beats both. (Note: differences in tokenizer, eval subset size, and data may contribute to the gap.)

3-Phase Training Pipeline

The key innovation: each phase compensates for the limitations of the previous one.

Phase 1 Foundation Pretrain 15K steps, WSD schedule
Wiki + SmolLM + Synthetic
Phase 2 Knowledge Distillation 7.5K steps, reverse KLD
30M teacher โ†’ 10M student
Phase 3 Instruction Finetune 12K steps, cosine LR
75K examples + anti-repetition

Result: 10M Optimal (val_loss 2.40) outperforms all 30M models (best: 2.61). Training methodology > parameter count.

Architecture

TypeLlama 2 decoder-only, deep-narrow (MobileLLM)
Parameters10,706,776
Dimensionsdim=256, hidden=720
Layers24 (12 unique, block sharing)
Attention8 heads, 2 KV heads (GQA 4:1)
EnhancementsDenseFormer + Value Residual + Register Tokens
Tokenizer8,192 BPE (SentencePiece), ChatML format
Max Sequence512 tokens
Model Size42.8 MB (fp32) / 21.4 MB (bf16)

Quick Start

# Install
pip install torch sentencepiece pyyaml numpy datasets

# Generate text
python scripts/chat.py --checkpoint checkpoints/10m_optimal/phase3_finetune/best.pt

# Run the full 3-phase training pipeline
python scripts/train_10m_optimal.py

# Run benchmarks
python scripts/benchmark_serious.py

Model Variants

We trained 8 variants to understand what works at this scale:

ModelParamsFinetune LossNotes
10M Optimal10.7M2.403-phase pipeline, best overall
10M Enhanced v210.7M2.92DenseFormer+VR+Registers
10M Dream10.7M2.98Ternary weights (failed)
10M Synthetic-only10.7M3.62Synthetic data = poor transfer
30M v2 Original30.0M2.61Best 30M, but 10M Optimal beats it
30M v2 Wiki30.0M2.80Wikipedia pretrain
30M Dream31.1M4.22Ternary, worst overall

Technical Report

Read the full technical report with architecture details, ablation studies, and analysis:

Brandon-Tiny 10M: A 3-Phase Training Pipeline for Ultra-Small Instruction-Following Language Models โ†’

Citation

@misc{brandon-tiny-2026,
  title={Brandon-Tiny 10M: A 3-Phase Training Pipeline for
         Ultra-Small Instruction-Following Language Models},
  author={Samuel Cortes},
  year={2026},
  url={https://xaskasdf.github.io/brandon-tiny/}
}