Small Models, Big Questions
The Real Bottleneck Isn't the Model — It's the Taxonomy
We wanted to know: can a tiny local LLM — running on a Mac Studio with no API calls — accurately classify real businesses into industry categories?
We ran 5 rounds of benchmarks across 4 models on 30 real Boise businesses. We tested different input strategies, prompt designs, taxonomy systems, and pipeline architectures. The results tell a clear story about where the real bottleneck lives -- and how we broke through it.
The Contenders
Three Rounds of Experiments
Each round changed a different variable. Same 20 businesses, same 4 models. We isolated input richness, then taxonomy design.
30 generic categories. Name-only input. Do small models understand business types at all?
Same 30 categories, but 3 input types: name only, +description, full context. Does more data help?
3 taxonomy systems: exact DB categories (425), NAICS-6, SIC-4. Does matching the target taxonomy matter?
Accuracy Heatmap
Model vs. strategy across all 3 rounds. The taxonomy dimension (v0.3) reveals the real story.
| v0.2 Input Type | v0.3 Taxonomy | |||||
|---|---|---|---|---|---|---|
| Name | +Desc | Full | Exact | NAICS | SIC | |
|
qwen 0.5b
|
15% |
10% |
15% |
15% |
0% |
0% |
|
gemma 2b
|
30% |
25% |
35% |
20% |
5% |
0% |
|
phi3 mini
|
25% |
30% |
30% |
35% |
0% |
0% |
|
llama 8b
|
30% |
30% |
30% |
50% |
0% |
0% |
Read the heatmap: The v0.2 columns (input type) are flat -- more data barely helps. The v0.3 columns (taxonomy) tell the real story: exact category match jumps llama to 50%, but NAICS/SIC codes are a wall of red. The bottleneck is taxonomy design.
Best Accuracy by Model
Peak accuracy across all experiments. llama3.1:8b takes the lead in v0.3 with exact category matching.
50% With Exact Categories, But 425 Is Too Many
When we gave llama the actual 425 categories from our database, accuracy jumped from 30% to 50%. But the ceiling is still low -- here's why.
What changed: In v0.1/v0.2, compound categories like hotel_bar didn't exist in the prompt's 30 generic options, so every compound answer was wrong. In v0.3, we gave the model all 425 real categories. llama went from 30% to 50%. But 425 categories is still too many for a small model to pick from reliably -- the remaining 50% errors come from the model hedging between similar options or inventing slight variations (brewery/restaurant instead of brewery).
The NAICS/SIC Question
Can small models map business names to 6-digit NAICS codes or 4-digit SIC codes? The answer is unambiguous.
Key insight: Small models can classify by name ("this is a brewery") but cannot map to industry codes ("this is NAICS 312120"). Industry codes are memorization tasks, not reasoning tasks. The models hallucinate plausible-looking 6-digit numbers that bear no relation to actual NAICS/SIC tables. The solution: classify first, then map via a deterministic lookup table.
The Breakthrough: v0.6
We said fine-tuning would get us to 90%. Turns out, better taxonomy design got us to 70% with ZERO training data.
425 raw database categories collapsed into 35 super-categories. Instead of asking the model to distinguish between "wine_bar", "cocktail_bar", "dive_bar", and "sports_bar" -- we just ask: is it a bar?
The model classifies into a super-category. A deterministic lookup table maps that to NAICS/SIC. The model handles what it's good at (semantics). The table handles what it can't do (codes).
The Two-Step Pipeline
Model handles semantics. Lookup table handles codes. No NAICS numbers in the prompt at all.
NAICS Accuracy Progression (llama3.1:8b)
Asked for NAICS codes directly. Model hallucinated every one.
Classified into 425 exact categories, then looked up NAICS. Better, but 425 categories still too many.
Consolidated to 35 super-categories. Model picks the right bucket; lookup gives the right code.
v0.6 Results: All 4 Models
Why it works: When super-category classification is correct, NAICS/SIC is automatically correct -- the lookup table is deterministic. The model only needs to answer "what kind of business is this?" in broad terms. It never sees a NAICS code. The 30 businesses tested in v0.6 include the original 20 plus 10 new ones across services, healthcare, and trades.
Rate Card: Cost per 1M Classifications
What does it actually cost to classify a million businesses? Local models run free. API models add up fast.
The economics: Even if claude-opus achieves 95% accuracy, the cost is $15,000 per million classifications. A fine-tuned qwen2.5:0.5b at 80% accuracy costs $0 and finishes in under 2 days. For batch classification at scale, local wins decisively.
Response Time
Average milliseconds per classification. All models ran locally on a Mac Studio M2 Ultra.
What We Learned
Split semantics from codes -- 0% to 70% NAICS
The v0.6 two-step pipeline (model classifies into super-category, lookup table gives NAICS/SIC) took llama3.1:8b from 0% to 70% NAICS accuracy with zero training data. The model never sees a NAICS code.
35 super-categories beat 425 exact categories
Consolidating 425 raw DB categories into 35 super-categories made the classification task tractable. With exact match, llama hit 50%. With super-categories, it hits 70% -- and gets NAICS/SIC for free.
Better taxonomy design > more training data
We got more improvement from restructuring the taxonomy (0% to 70%) than we expected from fine-tuning. The architecture matters more than the weights. Fine-tuning should push us from 70% to 90%+.
Input richness barely moves the needle
Adding descriptions and full context produced only a 2.5pp average improvement in v0.2. The bottleneck is prompt/category design, not the input data.
Local inference is absurdly cheap at scale
Classifying 1M businesses with qwen costs $0 and takes 42 hours. The same task via claude-opus would cost $15,000 and take 58 days. For batch workloads, local wins by orders of magnitude.
The Path Forward
We said fine-tuning would get us to 90%. Turns out, better taxonomy design got us to 70% with zero training data. Fine-tuning on 3,000 labeled examples should push us to 90%+. The architecture is proven: model handles semantics, lookup table handles codes.
Methodology
Hardware
Mac Studio M2 Ultra, 192GB RAM. All models via Ollama (local inference, no API calls).
Dataset
30 real businesses from Boise, Idaho. Hand-labeled with primary categories, super-categories, NAICS-6, and SIC-4 codes.
Prompt
Zero-shot classification. No examples. v0.1/v0.2 used 30 generic categories. v0.3 used exact DB categories, NAICS-6, and SIC-4. v0.6 used 35 super-categories + deterministic lookup.
Timing
First-call latency excluded. Averages computed from steady-state responses per model per taxonomy.