The Prompt That Changed Everything

The Experiment

Can a tiny local model figure out if a business sells to other businesses (B2B) or to consumers (B2C)? It sounds simple. Three possible answers: b2b, b2c, or both.

We tested 4 local models on 30 real Boise businesses with 3 different prompt strategies. No API calls. Everything running on a Mac Studio.

Why does this matter? If you sell business data to commercial real estate firms, the B2B/B2C tag determines the entire product. CRE firms want foot-traffic data for B2C tenants and lease-rate data for B2B tenants. Getting this wrong means showing the wrong product to the wrong customer.

5

Models

30

Businesses

3

Prompts

$0

API Cost

Three Prompt Strategies

Same models, same businesses. The only variable: how we asked the question.

v1 Name Only Best: 73% (gemma4)

Classify this business as b2b, b2c, or both.
Business: "Chip Cookies"
Answer with one word: b2b, b2c, or both.

Older models defaulted to "both" or "b2b" as a hedge. gemma4:e2b hit 73% on name alone — matching what other models needed reasoning prompts to achieve.

v2 + Description Best: 37%

Classify this business as b2b, b2c, or both.
Business: "Chip Cookies"
Type: bakery
Answer with one word: b2b, b2c, or both.

Adding the business type helped qwen jump to 37%, but most models still over-predicted "both".

v3 + Chain-of-Thought Reasoning Best: 77%

Classify this business as b2b, b2c, or both.
Business: "Chip Cookies"
Type: bakery

First, explain in one sentence who the primary
customer is. Then answer: b2b, b2c, or both.

The key insight: asking the model to explain before answering forced it to reason about the actual customer, not just pattern-match on the name.

Accuracy Heatmap

4 models x 3 prompt strategies. Strict = exact match. Lenient = "both" counted as partial match.

	Name Only	+ Description	+ Reasoning	Delta
qwen2.5:0.5b 397 MB	17%	37%	40%	+23
gemma2:2b 1.6 GB	17%	20%	77%	+60
phi3:mini 2.2 GB	10%	10%	43%	+33
llama3.1:8b 4.9 GB	3%	20%	73%	+70
gemma4:e2b NEW	73%	70%	53%	-20

	Name Only	+ Description	+ Reasoning
qwen2.5:0.5b	57%	67%	47%
gemma2:2b	97%	97%	90%
phi3:mini	100%	100%	63%
llama3.1:8b	100%	100%	93%

Read the heatmap: The "Reasoning" column is dramatically greener. Every model improved. gemma2:2b jumped 60 points. llama3.1:8b jumped 70 points. The delta column tells the story: chain-of-thought is a multiplier, not an increment.

The Chain-of-Thought Effect

Why does asking a model to explain its reasoning before answering work so well? Before-and-after on the same businesses tells the story.

Without Reasoning

Chip Cookies

bakery b2b

The Grove Hotel

hotel both

10th Street Barber Shop

barbershop both

Drake Cooper

agency both

Boise CrossFit

gym both

gemma2:2b name_only: 17% accuracy

With Reasoning

Chip Cookies

bakery b2c

The Grove Hotel

hotel b2c

10th Street Barber Shop

barbershop b2c

Drake Cooper

agency b2b

Boise CrossFit

gym b2c

gemma2:2b with_reasoning: 77% accuracy

Why it works

01

Forced decomposition. When a model must explain who the customer is, it activates the reasoning pathway instead of the pattern-matching pathway. "Cookies are sold to consumers" forces b2c.

02

Breaks the "both" default. Without reasoning, models hedge by saying "both." With reasoning, the model has to commit to a primary customer, which usually eliminates the ambiguity.

03

Works even on tiny models. qwen at 397 MB jumped from 17% to 40%. You do not need a massive model to benefit from chain-of-thought.

Speed vs. Accuracy

Bubble size = model size. Chain-of-thought (filled) trades ~2x latency for ~4x accuracy.

80% 60% 40% 20% 0%

100ms 300ms 500ms 700ms 900ms

Name only

+ Reasoning

qwen

gemma

phi3

llama

The Lenient Story

Models almost never get B2B/B2C completely backwards. They just say "both" too much. Lenient scoring (counting "both" as a partial match) reveals that even the weakest prompts have a usable signal.

Strict vs. Lenient Accuracy (Name Only)

qwen2.5:0.5b 17% / 57%

gemma2:2b 17% / 97%

phi3:mini 10% / 100%

llama3.1:8b 3% / 100%

Strict

Lenient bonus

The signal is there. llama3.1:8b at 3% strict / 100% lenient means it answered "both" for nearly everything -- but never said a B2B business was B2C or vice versa. The model understood the direction; it just needed help committing to one answer.

Model Reasoning in Action

When we asked the models to explain before answering, they produced surprisingly coherent reasoning -- even at 2 billion parameters.

gemma2:2b on "Chip Cookies" (bakery) b2c

"A bakery primarily sells baked goods directly to individual consumers who walk in to purchase cookies, cakes, and pastries for personal consumption."

Correct. Actual: b2c

gemma2:2b on "Clearwater Analytics" (technology) b2b

"Clearwater Analytics provides investment accounting and reporting software to institutional investors, asset managers, and insurance companies -- all business clients."

Correct. Actual: b2b

llama3.1:8b on "Coldwell Banker Tomlinson" (real_estate_brokerage) b2b

"A real estate brokerage primarily serves other businesses and agents in commercial and residential property transactions, acting as an intermediary between parties."

Correct. Actual: b2b

gemma2:2b on "Bar Gibbon at The Lively" (bar) both

"A bar serves drinks to walk-in consumers but also hosts private events for corporate groups, making it both B2B and B2C."

Incorrect. Actual: b2c. The reasoning was plausible but the model over-broadened.

Rate Card: Cost per 1M B2B/B2C Classifications

Model	Accuracy	Avg Latency	1M Time	Cost	Strategy
gemma4:e2b NEW	73%	236ms	66 hours	$0	Name Only
gemma2:2b	77%	499ms	138 hours	$0	Reasoning
llama3.1:8b	73%	801ms	222 hours	$0	Reasoning
phi3:mini	43%	611ms	170 hours	$0	Reasoning
qwen2.5:0.5b	40%	262ms	73 hours	$0	Reasoning
claude-sonnet	~95%*	~2s	556 hours	~$3,000	API
claude-opus	~98%*	~5s	1,389 hours	~$15,000	API

*API model estimates based on comparable classification tasks. Local models run via Ollama on Mac Studio M2 Ultra.

Why This Matters

For data companies selling to commercial real estate firms, the B2B/B2C tag determines the entire product.

B2C Businesses

CRE firms want foot traffic data, consumer demographics, peak hours, and retail sales metrics. Think restaurants, salons, gyms.

B2B Businesses

CRE firms want lease rates, employee counts, growth trajectories, and industry classification. Think law firms, SaaS companies, agencies.

Tag a law firm as B2C and your system shows them foot-traffic heatmaps. Tag a bakery as B2B and your system shows them enterprise lease comps. Both are useless. The B2B/B2C label is the routing decision that determines which product the customer sees.

Takeaways

1

Chain-of-thought is the single biggest lever for classification accuracy.

Bigger models, more data, and better descriptions all helped marginally. Asking the model to explain first was a 4x improvement.

2

A 1.6 GB model can do serious work with the right prompt.

gemma2:2b at 77% with reasoning is competitive enough for production pipelines, especially with a human review step for the remaining 23%.

3

Models rarely get the direction wrong. They just hedge with "both."

Lenient accuracy of 90-100% means the underlying signal is strong. The problem is decisiveness, not comprehension. Chain-of-thought fixes this.

4

The 397 MB qwen model at 40% with reasoning is impressive for its size.

If you need to run on constrained hardware (edge devices, low-memory VMs), qwen + chain-of-thought is a viable starting point.

5

The latency trade-off is acceptable.

Chain-of-thought roughly doubles inference time (181ms to 499ms for gemma). For a batch classification pipeline, 500ms per business is still fast enough to process 170K businesses per day on a single machine.

Path Forward

77%

Current (zero-shot CoT)

90%+

Fine-tuned CoT

95%+

+ Human review on low-confidence

Methodology

Hardware

Mac Studio M2 Ultra, 192GB RAM. All models via Ollama (local inference, no API calls).

Dataset

30 real businesses from Boise, Idaho. Hand-labeled: 10 B2B, 19 B2C, 1 Both.

Scoring

Strict = exact label match. Lenient = "both" counts as partial match (e.g., predicting "both" for a B2C business is lenient-correct).

Timing

First-call latency excluded. Averages from steady-state responses per model per prompt strategy.

Building classification pipelines for business data?

Let's talk on LinkedIn

Read the companion report: Industry Classification Benchmark

Product Hacker · Boise, Idaho · 2026