Research

Small Models, Big Questions

The Real Bottleneck Isn't the Model — It's the Taxonomy

Cam Fortin · Product Hacker · March 2026

We wanted to know: can a tiny local LLM — running on a Mac Studio with no API calls — accurately classify real businesses into industry categories?

We ran 5 rounds of benchmarks across 4 models on 30 real Boise businesses. We tested different input strategies, prompt designs, taxonomy systems, and pipeline architectures. The results tell a clear story about where the real bottleneck lives -- and how we broke through it.

83%
Best NAICS
gemma4:e2b + super-cats
0%→83%
NAICS/SIC
v0.3 → v0.7 (gemma4) progression
35
Super-categories
Consolidated from 425
$0
API Cost
100% local inference

The Contenders

qwen2.5:0.5b
Alibaba · 0.5 billion params
397 MB
gemma2:2b
Google · 2 billion params
1.6 GB
phi3:mini
Microsoft · 3.8 billion params
2.2 GB
llama3.1:8b
Meta · 8 billion params
4.9 GB
gemma4:e2b NEW
Google · 5.1 billion params (effective 2B)
7.2 GB

Three Rounds of Experiments

Each round changed a different variable. Same 20 businesses, same 4 models. We isolated input richness, then taxonomy design.

v0.1 Baseline

30 generic categories. Name-only input. Do small models understand business types at all?

Best: 30% (gemma2, phi3)
v0.2 Input Richness

Same 30 categories, but 3 input types: name only, +description, full context. Does more data help?

Best: 35% (gemma2 + full)
v0.3 Taxonomy

3 taxonomy systems: exact DB categories (425), NAICS-6, SIC-4. Does matching the target taxonomy matter?

Best: 50% (llama + exact)

Accuracy Heatmap

Model vs. strategy across all 3 rounds. The taxonomy dimension (v0.3) reveals the real story.

v0.2 Input Type v0.3 Taxonomy
Name +Desc Full Exact NAICS SIC
qwen 0.5b
15%
10%
15%
15%
0%
0%
gemma 2b
30%
25%
35%
20%
5%
0%
phi3 mini
25%
30%
30%
35%
0%
0%
llama 8b
30%
30%
30%
50%
0%
0%

Read the heatmap: The v0.2 columns (input type) are flat -- more data barely helps. The v0.3 columns (taxonomy) tell the real story: exact category match jumps llama to 50%, but NAICS/SIC codes are a wall of red. The bottleneck is taxonomy design.

Best Accuracy by Model

Peak accuracy across all experiments. llama3.1:8b takes the lead in v0.3 with exact category matching.

qwen2.5:0.5b 397MB
15%
3/20
gemma2:2b 1.6GB
35%
7/20
phi3:mini 2.2GB
35%
7/20
llama3.1:8b 4.9GB
50%
10/20
0% 25% 50% 75% 100%

50% With Exact Categories, But 425 Is Too Many

When we gave llama the actual 425 categories from our database, accuracy jumped from 30% to 50%. But the ceiling is still low -- here's why.

Business
Actual
Predicted
Match
Bar Gernika
bar_restaurant
bar_restaurant
Y
Modern Hotel & Bar
hotel_bar
hotel_bar
Y
Idaho Candy Co
candy_shop
candy_shop
Y
10 Barrel Brewing
brewery
brewery/restaurant
N
Negranti Creamery
dessert
ice cream shop
N

What changed: In v0.1/v0.2, compound categories like hotel_bar didn't exist in the prompt's 30 generic options, so every compound answer was wrong. In v0.3, we gave the model all 425 real categories. llama went from 30% to 50%. But 425 categories is still too many for a small model to pick from reliably -- the remaining 50% errors come from the model hedging between similar options or inventing slight variations (brewery/restaurant instead of brewery).

The NAICS/SIC Question

Can small models map business names to 6-digit NAICS codes or 4-digit SIC codes? The answer is unambiguous.

NAICS-6 Results 0-5% accuracy
qwen2.5:0.5b0%
gemma2:2b5% (1 lucky guess)
phi3:mini0%
llama3.1:8b0%
SIC-4 Results 0% accuracy
qwen2.5:0.5b0%
gemma2:2b0%
phi3:mini0%
llama3.1:8b0%
Business
Model Output (NAICS)
Reality
10 Barrel Brewing
312019
Hallucinated -- not a real NAICS
Chip Cookies
012345
Placeholder -- model gave up
ALAVITA
to det
Truncated -- model confused

Key insight: Small models can classify by name ("this is a brewery") but cannot map to industry codes ("this is NAICS 312120"). Industry codes are memorization tasks, not reasoning tasks. The models hallucinate plausible-looking 6-digit numbers that bear no relation to actual NAICS/SIC tables. The solution: classify first, then map via a deterministic lookup table.

Breakthrough

The Breakthrough: v0.6

We said fine-tuning would get us to 90%. Turns out, better taxonomy design got us to 70% with ZERO training data.

1 Consolidate the taxonomy

425 raw database categories collapsed into 35 super-categories. Instead of asking the model to distinguish between "wine_bar", "cocktail_bar", "dive_bar", and "sports_bar" -- we just ask: is it a bar?

2 Split semantics from codes

The model classifies into a super-category. A deterministic lookup table maps that to NAICS/SIC. The model handles what it's good at (semantics). The table handles what it can't do (codes).

The Two-Step Pipeline

Input
"10 Barrel Brewing"
LLM (Step 1)
brewery
Lookup (Step 2)
312120

Model handles semantics. Lookup table handles codes. No NAICS numbers in the prompt at all.

NAICS Accuracy Progression (llama3.1:8b)

v0.3 Direct NAICS prompting
0%

Asked for NAICS codes directly. Model hallucinated every one.

v0.4 Exact cats + lookup table
48%
~14/30

Classified into 425 exact categories, then looked up NAICS. Better, but 425 categories still too many.

v0.6 35 super-cats + lookup table
70%
21/30

Consolidated to 35 super-categories. Model picks the right bucket; lookup gives the right code.

v0.6 Results: All 4 Models

Model
Super-cat
NAICS
SIC
Avg Speed
llama3.1:8b
70%
70%
70%
401ms
gemma2:2b
53%
53%
57%
229ms
phi3:mini
53%
53%
53%
204ms
qwen2.5:0.5b
30%
30%
30%
164ms

Why it works: When super-category classification is correct, NAICS/SIC is automatically correct -- the lookup table is deterministic. The model only needs to answer "what kind of business is this?" in broad terms. It never sees a NAICS code. The 30 businesses tested in v0.6 include the original 20 plus 10 new ones across services, healthcare, and trades.

Rate Card: Cost per 1M Classifications

What does it actually cost to classify a million businesses? Local models run free. API models add up fast.

Model
Latency
Time for 1M
Cost
Type
qwen2.5:0.5b
150ms
41.7 hours
$0
Local
gemma2:2b
300ms
83.3 hours
$0
Local
phi3:mini
400ms
111 hours
$0
Local
llama3.1:8b
500ms
139 hours
$0
Local
gemma4:e2b NEW
205ms
57 hours
$0
Local
claude-sonnet
~2s
556 hours
~$3,000
API
claude-opus
~5s
1,389 hours
~$15,000
API

The economics: Even if claude-opus achieves 95% accuracy, the cost is $15,000 per million classifications. A fine-tuned qwen2.5:0.5b at 80% accuracy costs $0 and finishes in under 2 days. For batch classification at scale, local wins decisively.

Response Time

Average milliseconds per classification. All models ran locally on a Mac Studio M2 Ultra.

qwen2.5:0.5b
150ms
gemma2:2b
300ms
phi3:mini
400ms
gemma4:e2b NEW
205ms
llama3.1:8b
558ms
0ms 150ms 300ms 450ms 600ms

What We Learned

1

Split semantics from codes -- 0% to 70% NAICS

The v0.6 two-step pipeline (model classifies into super-category, lookup table gives NAICS/SIC) took llama3.1:8b from 0% to 70% NAICS accuracy with zero training data. The model never sees a NAICS code.

2

35 super-categories beat 425 exact categories

Consolidating 425 raw DB categories into 35 super-categories made the classification task tractable. With exact match, llama hit 50%. With super-categories, it hits 70% -- and gets NAICS/SIC for free.

3

Better taxonomy design > more training data

We got more improvement from restructuring the taxonomy (0% to 70%) than we expected from fine-tuning. The architecture matters more than the weights. Fine-tuning should push us from 70% to 90%+.

4

Input richness barely moves the needle

Adding descriptions and full context produced only a 2.5pp average improvement in v0.2. The bottleneck is prompt/category design, not the input data.

5

Local inference is absurdly cheap at scale

Classifying 1M businesses with qwen costs $0 and takes 42 hours. The same task via claude-opus would cost $15,000 and take 58 days. For batch workloads, local wins by orders of magnitude.

The Path Forward

We said fine-tuning would get us to 90%. Turns out, better taxonomy design got us to 70% with zero training data. Fine-tuning on 3,000 labeled examples should push us to 90%+. The architecture is proven: model handles semantics, lookup table handles codes.

0%
v0.3 NAICS
70%
v0.6 (zero-shot!)
90%+
Fine-tuned (next)

Methodology

Hardware

Mac Studio M2 Ultra, 192GB RAM. All models via Ollama (local inference, no API calls).

Dataset

30 real businesses from Boise, Idaho. Hand-labeled with primary categories, super-categories, NAICS-6, and SIC-4 codes.

Prompt

Zero-shot classification. No examples. v0.1/v0.2 used 30 generic categories. v0.3 used exact DB categories, NAICS-6, and SIC-4. v0.6 used 35 super-categories + deterministic lookup.

Timing

First-call latency excluded. Averages computed from steady-state responses per model per taxonomy.

Building something similar?

Let's talk on LinkedIn

Product Hacker · Boise, Idaho · 2026