Research & Benchmarks

How we build open data infrastructure for the agent economy. Local LLMs, entity resolution, and classification at scale.

The token-maxing canyon.
Subscription buffets trained a generation of token-maxers. The buffet is closing — Claude Code is out of the $20 plan, weekly caps are tightening, and rolling windows are shrinking. Here's the math behind why it had to, and the incentive flip we should actually want.
The night we wrote our data's axioms.
From a 'Not found' page to 9 of 9 hard data invariants enforced. A six-hour autopilot cleanup that collapsed 411 duplicate company rows, canonicalized 1,042 ugly slugs, and backfilled 21,968 missing profiles — and the daily job that keeps it that way.
OnlyData already labeled 90 companies AI-native. The AR scorer ignored them.
Of 90 companies the ai_native_v1 classifier tagged as genuinely AI-native, only ONE scored above AR 50. Character.AI scored 0. Cognition/Devin scored 11. The fix isn't a learned model — it's a one-line floor.
The Shape of AI: How Similarity Scores Reveal Hidden Patterns
We embedded 24K companies into 384-dimensional vector space and explored what the geometry tells us about industry boundaries, agent readiness, and invisible connections. Interactive D3 scatters, spider charts, and force-directed graphs.
Fixing the Vibe Coder Stack: We Asked Claude to Audit Our Dataset Through Our Own MCP
Our auto-classified Vibe Coder Stack had 94 cos in 38 categories — including a gas station and an optometrist. We pointed Claude at it through the OnlyData MCP, got a brutal audit, and rebuilt it as a 90-company curated list across 9 sublists.
Eating Our Own Dogfood: How We Used Our Own MCP to Improve Our Data
We shipped the OnlyData MCP, then pointed it at our own catalog and found a massive blind spot in the agentic AI layer. 199 companies promoted from a private list to a public dataset — 155+ of them net new.
Agent Readiness: The Attribute Nobody Has
We scanned 8,250+ business websites for AI agent readiness with Algorithm E (Spread, v3). Six categories, subdomain probing, 0-100 score. Average 24. Only 9.3% grade A. Just 3% cross the well-behaved-SaaS ceiling of 50. Digits tops the list at 93.
The Prompt That Changed Everything: B2B Classification from 17% to 77%
We tested 5 local LLMs with 3 prompt strategies on 30 real Boise businesses. Chain-of-thought reasoning took accuracy from 17% to 77%. gemma4:e2b hits 73% on name alone.
Small Models, Big Questions: The Real Bottleneck Isn't the Model
5 models, 5 rounds, 30 real businesses. We went from 0% NAICS accuracy to 83% by splitting semantics from codes. The taxonomy was the bottleneck, not the model.