The Prompt That Changed Everything
How Chain-of-Thought Took B2B Classification from 17% to 77%
The Experiment
Can a tiny local model figure out if a business sells to other businesses (B2B) or to consumers (B2C)? It sounds simple. Three possible answers: b2b, b2c, or both.
We tested 4 local models on 30 real Boise businesses with 3 different prompt strategies. No API calls. Everything running on a Mac Studio.
Why does this matter? If you sell business data to commercial real estate firms, the B2B/B2C tag determines the entire product. CRE firms want foot-traffic data for B2C tenants and lease-rate data for B2B tenants. Getting this wrong means showing the wrong product to the wrong customer.
Three Prompt Strategies
Same models, same businesses. The only variable: how we asked the question.
Business: "Chip Cookies"
Answer with one word: b2b, b2c, or both.
Older models defaulted to "both" or "b2b" as a hedge. gemma4:e2b hit 73% on name alone — matching what other models needed reasoning prompts to achieve.
Business: "Chip Cookies"
Type: bakery
Answer with one word: b2b, b2c, or both.
Adding the business type helped qwen jump to 37%, but most models still over-predicted "both".
Business: "Chip Cookies"
Type: bakery
First, explain in one sentence who the primary
customer is. Then answer: b2b, b2c, or both.
The key insight: asking the model to explain before answering forced it to reason about the actual customer, not just pattern-match on the name.
Accuracy Heatmap
4 models x 3 prompt strategies. Strict = exact match. Lenient = "both" counted as partial match.
| Name Only | + Description | + Reasoning | Delta | |
|---|---|---|---|---|
|
qwen2.5:0.5b
397 MB
|
17% |
37% |
40% |
+23 |
|
gemma2:2b
1.6 GB
|
17% |
20% |
77% |
+60 |
|
phi3:mini
2.2 GB
|
10% |
10% |
43% |
+33 |
|
llama3.1:8b
4.9 GB
|
3% |
20% |
73% |
+70 |
|
gemma4:e2b
NEW
|
73% |
70% |
53% |
-20 |
Read the heatmap: The "Reasoning" column is dramatically greener. Every model improved. gemma2:2b jumped 60 points. llama3.1:8b jumped 70 points. The delta column tells the story: chain-of-thought is a multiplier, not an increment.
The Chain-of-Thought Effect
Why does asking a model to explain its reasoning before answering work so well? Before-and-after on the same businesses tells the story.
Why it works
Forced decomposition. When a model must explain who the customer is, it activates the reasoning pathway instead of the pattern-matching pathway. "Cookies are sold to consumers" forces b2c.
Breaks the "both" default. Without reasoning, models hedge by saying "both." With reasoning, the model has to commit to a primary customer, which usually eliminates the ambiguity.
Works even on tiny models. qwen at 397 MB jumped from 17% to 40%. You do not need a massive model to benefit from chain-of-thought.
Speed vs. Accuracy
Bubble size = model size. Chain-of-thought (filled) trades ~2x latency for ~4x accuracy.
The Lenient Story
Models almost never get B2B/B2C completely backwards. They just say "both" too much. Lenient scoring (counting "both" as a partial match) reveals that even the weakest prompts have a usable signal.
The signal is there. llama3.1:8b at 3% strict / 100% lenient means it answered "both" for nearly everything -- but never said a B2B business was B2C or vice versa. The model understood the direction; it just needed help committing to one answer.
Model Reasoning in Action
When we asked the models to explain before answering, they produced surprisingly coherent reasoning -- even at 2 billion parameters.
"A bakery primarily sells baked goods directly to individual consumers who walk in to purchase cookies, cakes, and pastries for personal consumption."
"Clearwater Analytics provides investment accounting and reporting software to institutional investors, asset managers, and insurance companies -- all business clients."
"A real estate brokerage primarily serves other businesses and agents in commercial and residential property transactions, acting as an intermediary between parties."
"A bar serves drinks to walk-in consumers but also hosts private events for corporate groups, making it both B2B and B2C."
Rate Card: Cost per 1M B2B/B2C Classifications
| Model | Accuracy | Avg Latency | 1M Time | Cost | Strategy |
|---|---|---|---|---|---|
| gemma4:e2b NEW | 73% | 236ms | 66 hours | $0 | Name Only |
| gemma2:2b | 77% | 499ms | 138 hours | $0 | Reasoning |
| llama3.1:8b | 73% | 801ms | 222 hours | $0 | Reasoning |
| phi3:mini | 43% | 611ms | 170 hours | $0 | Reasoning |
| qwen2.5:0.5b | 40% | 262ms | 73 hours | $0 | Reasoning |
| claude-sonnet | ~95%* | ~2s | 556 hours | ~$3,000 | API |
| claude-opus | ~98%* | ~5s | 1,389 hours | ~$15,000 | API |
*API model estimates based on comparable classification tasks. Local models run via Ollama on Mac Studio M2 Ultra.
Why This Matters
For data companies selling to commercial real estate firms, the B2B/B2C tag determines the entire product.
CRE firms want foot traffic data, consumer demographics, peak hours, and retail sales metrics. Think restaurants, salons, gyms.
CRE firms want lease rates, employee counts, growth trajectories, and industry classification. Think law firms, SaaS companies, agencies.
Tag a law firm as B2C and your system shows them foot-traffic heatmaps. Tag a bakery as B2B and your system shows them enterprise lease comps. Both are useless. The B2B/B2C label is the routing decision that determines which product the customer sees.
Takeaways
Chain-of-thought is the single biggest lever for classification accuracy.
Bigger models, more data, and better descriptions all helped marginally. Asking the model to explain first was a 4x improvement.
A 1.6 GB model can do serious work with the right prompt.
gemma2:2b at 77% with reasoning is competitive enough for production pipelines, especially with a human review step for the remaining 23%.
Models rarely get the direction wrong. They just hedge with "both."
Lenient accuracy of 90-100% means the underlying signal is strong. The problem is decisiveness, not comprehension. Chain-of-thought fixes this.
The 397 MB qwen model at 40% with reasoning is impressive for its size.
If you need to run on constrained hardware (edge devices, low-memory VMs), qwen + chain-of-thought is a viable starting point.
The latency trade-off is acceptable.
Chain-of-thought roughly doubles inference time (181ms to 499ms for gemma). For a batch classification pipeline, 500ms per business is still fast enough to process 170K businesses per day on a single machine.
Path Forward
Methodology
Hardware
Mac Studio M2 Ultra, 192GB RAM. All models via Ollama (local inference, no API calls).
Dataset
30 real businesses from Boise, Idaho. Hand-labeled: 10 B2B, 19 B2C, 1 Both.
Scoring
Strict = exact label match. Lenient = "both" counts as partial match (e.g., predicting "both" for a B2C business is lenient-correct).
Timing
First-call latency excluded. Averages from steady-state responses per model per prompt strategy.
Building classification pipelines for business data?
Let's talk on LinkedInProduct Hacker · Boise, Idaho · 2026