Small Models, Big Questions

We wanted to know: can a tiny local LLM — running on a Mac Studio with no API calls — accurately classify real businesses into industry categories?

We ran 5 rounds of benchmarks across 4 models on 30 real Boise businesses. We tested different input strategies, prompt designs, taxonomy systems, and pipeline architectures. The results tell a clear story about where the real bottleneck lives -- and how we broke through it.

83%

Best NAICS

gemma4:e2b + super-cats

0%→83%

NAICS/SIC

v0.3 → v0.7 (gemma4) progression

35

Super-categories

Consolidated from 425

$0

API Cost

100% local inference

The Contenders

qwen2.5:0.5b

Alibaba · 0.5 billion params

397 MB

gemma2:2b

Google · 2 billion params

1.6 GB

phi3:mini

Microsoft · 3.8 billion params

2.2 GB

llama3.1:8b

Meta · 8 billion params

4.9 GB

gemma4:e2b NEW

Google · 5.1 billion params (effective 2B)

7.2 GB

Three Rounds of Experiments

Each round changed a different variable. Same 20 businesses, same 4 models. We isolated input richness, then taxonomy design.

v0.1 Baseline

30 generic categories. Name-only input. Do small models understand business types at all?

Best: 30% (gemma2, phi3)

v0.2 Input Richness

Same 30 categories, but 3 input types: name only, +description, full context. Does more data help?

Best: 35% (gemma2 + full)

v0.3 Taxonomy

3 taxonomy systems: exact DB categories (425), NAICS-6, SIC-4. Does matching the target taxonomy matter?

Best: 50% (llama + exact)

Accuracy Heatmap

Model vs. strategy across all 3 rounds. The taxonomy dimension (v0.3) reveals the real story.

	v0.2 Input Type			v0.3 Taxonomy
	Name	+Desc	Full	Exact	NAICS	SIC
qwen 0.5b	15%	10%	15%	15%	0%	0%
gemma 2b	30%	25%	35%	20%	5%	0%
phi3 mini	25%	30%	30%	35%	0%	0%
llama 8b	30%	30%	30%	50%	0%	0%

Read the heatmap: The v0.2 columns (input type) are flat -- more data barely helps. The v0.3 columns (taxonomy) tell the real story: exact category match jumps llama to 50%, but NAICS/SIC codes are a wall of red. The bottleneck is taxonomy design.

Best Accuracy by Model

Peak accuracy across all experiments. llama3.1:8b takes the lead in v0.3 with exact category matching.

qwen2.5:0.5b 397MB

15%

3/20

gemma2:2b 1.6GB

35%

7/20

phi3:mini 2.2GB

35%

7/20

llama3.1:8b 4.9GB

50%

10/20

0% 25% 50% 75% 100%

50% With Exact Categories, But 425 Is Too Many

When we gave llama the actual 425 categories from our database, accuracy jumped from 30% to 50%. But the ceiling is still low -- here's why.

Business

Actual

Predicted

Match

Bar Gernika

bar_restaurant

Y

Modern Hotel & Bar

hotel_bar

Y

Idaho Candy Co

candy_shop

Y

10 Barrel Brewing

brewery

brewery/restaurant

N

Negranti Creamery

dessert

ice cream shop

N

What changed: In v0.1/v0.2, compound categories like hotel_bar didn't exist in the prompt's 30 generic options, so every compound answer was wrong. In v0.3, we gave the model all 425 real categories. llama went from 30% to 50%. But 425 categories is still too many for a small model to pick from reliably -- the remaining 50% errors come from the model hedging between similar options or inventing slight variations (brewery/restaurant instead of brewery).

The NAICS/SIC Question

Can small models map business names to 6-digit NAICS codes or 4-digit SIC codes? The answer is unambiguous.

NAICS-6 Results 0-5% accuracy

qwen2.5:0.5b0%

gemma2:2b5% (1 lucky guess)

phi3:mini0%

llama3.1:8b0%

SIC-4 Results 0% accuracy

qwen2.5:0.5b0%

gemma2:2b0%

phi3:mini0%

llama3.1:8b0%

Business

Model Output (NAICS)

Reality

10 Barrel Brewing

312019

Hallucinated -- not a real NAICS

Chip Cookies

012345

Placeholder -- model gave up

ALAVITA

to det

Truncated -- model confused

Key insight: Small models can classify by name ("this is a brewery") but cannot map to industry codes ("this is NAICS 312120"). Industry codes are memorization tasks, not reasoning tasks. The models hallucinate plausible-looking 6-digit numbers that bear no relation to actual NAICS/SIC tables. The solution: classify first, then map via a deterministic lookup table.

Breakthrough

The Breakthrough: v0.6

We said fine-tuning would get us to 90%. Turns out, better taxonomy design got us to 70% with ZERO training data.

1 Consolidate the taxonomy

425 raw database categories collapsed into 35 super-categories. Instead of asking the model to distinguish between "wine_bar", "cocktail_bar", "dive_bar", and "sports_bar" -- we just ask: is it a bar?

2 Split semantics from codes

The model classifies into a super-category. A deterministic lookup table maps that to NAICS/SIC. The model handles what it's good at (semantics). The table handles what it can't do (codes).

The Two-Step Pipeline

Input

"10 Barrel Brewing"

LLM (Step 1)

brewery

Lookup (Step 2)

312120

Model handles semantics. Lookup table handles codes. No NAICS numbers in the prompt at all.

NAICS Accuracy Progression (llama3.1:8b)

v0.3 Direct NAICS prompting

0%

Asked for NAICS codes directly. Model hallucinated every one.

v0.4 Exact cats + lookup table

48%

~14/30

Classified into 425 exact categories, then looked up NAICS. Better, but 425 categories still too many.

v0.6 35 super-cats + lookup table

70%

21/30

Consolidated to 35 super-categories. Model picks the right bucket; lookup gives the right code.

v0.6 Results: All 4 Models

Model

Super-cat

NAICS

SIC

Avg Speed

llama3.1:8b

70%

401ms

gemma2:2b

53%

57%

229ms

phi3:mini

53%

204ms

qwen2.5:0.5b

30%

164ms

Why it works: When super-category classification is correct, NAICS/SIC is automatically correct -- the lookup table is deterministic. The model only needs to answer "what kind of business is this?" in broad terms. It never sees a NAICS code. The 30 businesses tested in v0.6 include the original 20 plus 10 new ones across services, healthcare, and trades.

Rate Card: Cost per 1M Classifications

What does it actually cost to classify a million businesses? Local models run free. API models add up fast.

Model

Latency

Time for 1M

Cost

Type

qwen2.5:0.5b

150ms

41.7 hours

$0

Local

gemma2:2b

300ms

83.3 hours

$0

Local

phi3:mini

400ms

111 hours

$0

Local

llama3.1:8b

500ms

139 hours

$0

Local

gemma4:e2b NEW

205ms

57 hours

$0

Local

claude-sonnet

~2s

556 hours

~$3,000

API

claude-opus

~5s

1,389 hours

~$15,000

API

The economics: Even if claude-opus achieves 95% accuracy, the cost is $15,000 per million classifications. A fine-tuned qwen2.5:0.5b at 80% accuracy costs $0 and finishes in under 2 days. For batch classification at scale, local wins decisively.

Response Time

Average milliseconds per classification. All models ran locally on a Mac Studio M2 Ultra.

qwen2.5:0.5b

150ms

gemma2:2b

300ms

phi3:mini

400ms

gemma4:e2b NEW

205ms

llama3.1:8b

558ms

0ms 150ms 300ms 450ms 600ms

What We Learned

1

Split semantics from codes -- 0% to 70% NAICS

The v0.6 two-step pipeline (model classifies into super-category, lookup table gives NAICS/SIC) took llama3.1:8b from 0% to 70% NAICS accuracy with zero training data. The model never sees a NAICS code.

2

35 super-categories beat 425 exact categories

Consolidating 425 raw DB categories into 35 super-categories made the classification task tractable. With exact match, llama hit 50%. With super-categories, it hits 70% -- and gets NAICS/SIC for free.

3

Better taxonomy design > more training data

We got more improvement from restructuring the taxonomy (0% to 70%) than we expected from fine-tuning. The architecture matters more than the weights. Fine-tuning should push us from 70% to 90%+.

4

Input richness barely moves the needle

Adding descriptions and full context produced only a 2.5pp average improvement in v0.2. The bottleneck is prompt/category design, not the input data.

5

Local inference is absurdly cheap at scale

Classifying 1M businesses with qwen costs $0 and takes 42 hours. The same task via claude-opus would cost $15,000 and take 58 days. For batch workloads, local wins by orders of magnitude.

The Path Forward

We said fine-tuning would get us to 90%. Turns out, better taxonomy design got us to 70% with zero training data. Fine-tuning on 3,000 labeled examples should push us to 90%+. The architecture is proven: model handles semantics, lookup table handles codes.

0%

v0.3 NAICS

70%

v0.6 (zero-shot!)

90%+

Fine-tuned (next)

Methodology

Hardware

Mac Studio M2 Ultra, 192GB RAM. All models via Ollama (local inference, no API calls).

Dataset

30 real businesses from Boise, Idaho. Hand-labeled with primary categories, super-categories, NAICS-6, and SIC-4 codes.

Prompt

Zero-shot classification. No examples. v0.1/v0.2 used 30 generic categories. v0.3 used exact DB categories, NAICS-6, and SIC-4. v0.6 used 35 super-categories + deterministic lookup.

Timing

First-call latency excluded. Averages computed from steady-state responses per model per taxonomy.

Building something similar?

Let's talk on LinkedIn

Product Hacker · Boise, Idaho · 2026