Research

The Prompt That Changed Everything

How Chain-of-Thought Took B2B Classification from 17% to 77%

Cam Fortin · Product Hacker · April 2026
17%
Name Only
77%
Chain-of-Thought
gemma2:2b
1.6 GB model

The Experiment

Can a tiny local model figure out if a business sells to other businesses (B2B) or to consumers (B2C)? It sounds simple. Three possible answers: b2b, b2c, or both.

We tested 4 local models on 30 real Boise businesses with 3 different prompt strategies. No API calls. Everything running on a Mac Studio.

Why does this matter? If you sell business data to commercial real estate firms, the B2B/B2C tag determines the entire product. CRE firms want foot-traffic data for B2C tenants and lease-rate data for B2B tenants. Getting this wrong means showing the wrong product to the wrong customer.

5
Models
30
Businesses
3
Prompts
$0
API Cost

Three Prompt Strategies

Same models, same businesses. The only variable: how we asked the question.

v1 Name Only Best: 73% (gemma4)
Classify this business as b2b, b2c, or both.
Business: "Chip Cookies"
Answer with one word: b2b, b2c, or both.

Older models defaulted to "both" or "b2b" as a hedge. gemma4:e2b hit 73% on name alone — matching what other models needed reasoning prompts to achieve.

v2 + Description Best: 37%
Classify this business as b2b, b2c, or both.
Business: "Chip Cookies"
Type: bakery
Answer with one word: b2b, b2c, or both.

Adding the business type helped qwen jump to 37%, but most models still over-predicted "both".

v3 + Chain-of-Thought Reasoning Best: 77%
Classify this business as b2b, b2c, or both.
Business: "Chip Cookies"
Type: bakery

First, explain in one sentence who the primary
customer is. Then answer: b2b, b2c, or both.

The key insight: asking the model to explain before answering forced it to reason about the actual customer, not just pattern-match on the name.

Accuracy Heatmap

4 models x 3 prompt strategies. Strict = exact match. Lenient = "both" counted as partial match.

Name Only + Description + Reasoning Delta
qwen2.5:0.5b 397 MB
17%
37%
40%
+23
gemma2:2b 1.6 GB
17%
20%
77%
+60
phi3:mini 2.2 GB
10%
10%
43%
+33
llama3.1:8b 4.9 GB
3%
20%
73%
+70
gemma4:e2b NEW
73%
70%
53%
-20

Read the heatmap: The "Reasoning" column is dramatically greener. Every model improved. gemma2:2b jumped 60 points. llama3.1:8b jumped 70 points. The delta column tells the story: chain-of-thought is a multiplier, not an increment.

The Chain-of-Thought Effect

Why does asking a model to explain its reasoning before answering work so well? Before-and-after on the same businesses tells the story.

Without Reasoning
Chip Cookies
bakery b2b
The Grove Hotel
hotel both
10th Street Barber Shop
barbershop both
Drake Cooper
agency both
Boise CrossFit
gym both
gemma2:2b name_only: 17% accuracy
With Reasoning
Chip Cookies
bakery b2c
The Grove Hotel
hotel b2c
10th Street Barber Shop
barbershop b2c
Drake Cooper
agency b2b
Boise CrossFit
gym b2c
gemma2:2b with_reasoning: 77% accuracy

Why it works

01

Forced decomposition. When a model must explain who the customer is, it activates the reasoning pathway instead of the pattern-matching pathway. "Cookies are sold to consumers" forces b2c.

02

Breaks the "both" default. Without reasoning, models hedge by saying "both." With reasoning, the model has to commit to a primary customer, which usually eliminates the ambiguity.

03

Works even on tiny models. qwen at 397 MB jumped from 17% to 40%. You do not need a massive model to benefit from chain-of-thought.

Speed vs. Accuracy

Bubble size = model size. Chain-of-thought (filled) trades ~2x latency for ~4x accuracy.

80% 60% 40% 20% 0%
100ms 300ms 500ms 700ms 900ms
Name only
+ Reasoning
qwen
gemma
phi3
llama

The Lenient Story

Models almost never get B2B/B2C completely backwards. They just say "both" too much. Lenient scoring (counting "both" as a partial match) reveals that even the weakest prompts have a usable signal.

Strict vs. Lenient Accuracy (Name Only)
qwen2.5:0.5b 17% / 57%
gemma2:2b 17% / 97%
phi3:mini 10% / 100%
llama3.1:8b 3% / 100%
Strict
Lenient bonus

The signal is there. llama3.1:8b at 3% strict / 100% lenient means it answered "both" for nearly everything -- but never said a B2B business was B2C or vice versa. The model understood the direction; it just needed help committing to one answer.

Model Reasoning in Action

When we asked the models to explain before answering, they produced surprisingly coherent reasoning -- even at 2 billion parameters.

gemma2:2b on "Chip Cookies" (bakery) b2c

"A bakery primarily sells baked goods directly to individual consumers who walk in to purchase cookies, cakes, and pastries for personal consumption."

Correct. Actual: b2c
gemma2:2b on "Clearwater Analytics" (technology) b2b

"Clearwater Analytics provides investment accounting and reporting software to institutional investors, asset managers, and insurance companies -- all business clients."

Correct. Actual: b2b
llama3.1:8b on "Coldwell Banker Tomlinson" (real_estate_brokerage) b2b

"A real estate brokerage primarily serves other businesses and agents in commercial and residential property transactions, acting as an intermediary between parties."

Correct. Actual: b2b
gemma2:2b on "Bar Gibbon at The Lively" (bar) both

"A bar serves drinks to walk-in consumers but also hosts private events for corporate groups, making it both B2B and B2C."

Incorrect. Actual: b2c. The reasoning was plausible but the model over-broadened.

Rate Card: Cost per 1M B2B/B2C Classifications

Model Accuracy Avg Latency 1M Time Cost Strategy
gemma4:e2b NEW 73% 236ms 66 hours $0 Name Only
gemma2:2b 77% 499ms 138 hours $0 Reasoning
llama3.1:8b 73% 801ms 222 hours $0 Reasoning
phi3:mini 43% 611ms 170 hours $0 Reasoning
qwen2.5:0.5b 40% 262ms 73 hours $0 Reasoning
claude-sonnet ~95%* ~2s 556 hours ~$3,000 API
claude-opus ~98%* ~5s 1,389 hours ~$15,000 API

*API model estimates based on comparable classification tasks. Local models run via Ollama on Mac Studio M2 Ultra.

Why This Matters

For data companies selling to commercial real estate firms, the B2B/B2C tag determines the entire product.

B2C Businesses

CRE firms want foot traffic data, consumer demographics, peak hours, and retail sales metrics. Think restaurants, salons, gyms.

B2B Businesses

CRE firms want lease rates, employee counts, growth trajectories, and industry classification. Think law firms, SaaS companies, agencies.

Tag a law firm as B2C and your system shows them foot-traffic heatmaps. Tag a bakery as B2B and your system shows them enterprise lease comps. Both are useless. The B2B/B2C label is the routing decision that determines which product the customer sees.

Takeaways

1

Chain-of-thought is the single biggest lever for classification accuracy.

Bigger models, more data, and better descriptions all helped marginally. Asking the model to explain first was a 4x improvement.

2

A 1.6 GB model can do serious work with the right prompt.

gemma2:2b at 77% with reasoning is competitive enough for production pipelines, especially with a human review step for the remaining 23%.

3

Models rarely get the direction wrong. They just hedge with "both."

Lenient accuracy of 90-100% means the underlying signal is strong. The problem is decisiveness, not comprehension. Chain-of-thought fixes this.

4

The 397 MB qwen model at 40% with reasoning is impressive for its size.

If you need to run on constrained hardware (edge devices, low-memory VMs), qwen + chain-of-thought is a viable starting point.

5

The latency trade-off is acceptable.

Chain-of-thought roughly doubles inference time (181ms to 499ms for gemma). For a batch classification pipeline, 500ms per business is still fast enough to process 170K businesses per day on a single machine.

Path Forward

77%
Current (zero-shot CoT)
90%+
Fine-tuned CoT
95%+
+ Human review on low-confidence

Methodology

Hardware

Mac Studio M2 Ultra, 192GB RAM. All models via Ollama (local inference, no API calls).

Dataset

30 real businesses from Boise, Idaho. Hand-labeled: 10 B2B, 19 B2C, 1 Both.

Scoring

Strict = exact label match. Lenient = "both" counts as partial match (e.g., predicting "both" for a B2C business is lenient-correct).

Timing

First-call latency excluded. Averages from steady-state responses per model per prompt strategy.

Building classification pipelines for business data?

Let's talk on LinkedIn

Product Hacker · Boise, Idaho · 2026