Fixing the Vibe Coder Stack
We asked Claude to audit our own dataset through the OnlyData MCP. The verdict was honest, the gaps were obvious, and the fix shipped the same day.
The original list was clever — and broken
The Vibe Coder Stack started as an experiment in auto-curation. We had a classification pipeline that could read a company's website, decide if it was AI-native, and tag it as a tool_provider, infrastructure, or platform. Run that filter across our 24,000+ company catalog and you get a list that looks like a builder-tool universe — for free, no human curation needed.
It pulled in 94 companies. Some of them were great: Replit, Groq, Humanloop, Vast.ai, BentoML, Snyk, Apify, Slack. Real signal. The agent readiness scores worked. The B2B/B2C tags worked. The auto-pipeline was doing real work.
And then it had a gas station in it.
We asked Claude to audit it
Instead of eyeballing the list ourselves, we did the dogfood thing: we opened a Claude session, connected it to our own MCP server, and asked it to analyze the dataset and tell us what was broken. Claude pulled the data through mcp__onlydata__query_custom_dataset, ran the math, and produced this artifact — which it then saved back to our profile via mcp__onlydata__create_artifact:
Vibe Coder Stack — OnlyData dataset analysis. 94 companies, 38 categories with interactive charts, insights, and improvement opportunities.
Companies
94
screened from 24k+
Categories
38
too many for 94 cos
Avg AR score
25
out of 100
Noise rate
~17%
off-topic entries
B2B vs B2C split
Taxonomy quality
Top categories by company count
All 38 categories — the fragmentation problem
Company signal quality
16 off-topic 78 on-topic
Red-bordered = geographic data bleed (Boise/NC IT MSPs, medical offices, gas station, construction PM tools).
Top score
100
OnlyData Club
Average
~25
median ~22
Lowest
2
4 companies at floor
AR score distribution
Top 15 by AR score
What's genuinely great
What to fix
Missing big players
The obvious omissions for a 2026 vibe coder stack list:
The diagnosis was specific
Claude's audit was unflinching. The shape of the problem:
- 38 categories for 94 companies. 28 of those categories had exactly one company in them. "Technology" was the catch-all bucket holding 38% of the entire list. There was no taxonomy — there was just whatever the classifier decided to write down.
- ~17% noise rate. 16 of the 94 companies were geographic data bleed: Boise IT MSPs, North Carolina managed service providers, a gas station, an optometrist, a construction PM tool. Real businesses, but not what a vibe coder would ever look up.
- Generic descriptions. "Cutting-edge company specializing in innovative solutions" appeared verbatim ~30 times. The classifier had filled in placeholder text instead of reading the actual site.
- Missing every name that mattered. No Cursor. No Windsurf. No Vercel. No Supabase. No Anthropic. No OpenAI. No Pinecone. No LangChain. No Linear. The list of obvious omissions was longer than a lot of competing "best of" lists.
The kind thing Claude said was that the underlying signal was real. AR scoring worked, the seed companies were legitimate, the B2B/B2C tagging was useful. The infrastructure was sound. The list, as a published artifact, was not.
How we used our own MCP to do this
Claude pulled live data from our own catalog, ran the analysis, then wrote it back as an artifact
Three OnlyData MCP tools, one Claude session, no manual exports:
mcp__onlydata__query_custom_dataset— Claude pulled the live Vibe Coder Stack rows directly from our Supabase, including AR scores, super_categories, descriptions, and the agent_ecosystem_role field that drove the auto-classification.mcp__onlydata__semantic_search— to spot-check whether obvious cos like Cursor and Vercel were missing entirely or just mis-tagged. (They were missing entirely.)mcp__onlydata__create_artifact— Claude generated the analysis as a single HTML artifact and saved it to our profile underonlydata-club/artifacts/. The exact file you saw embedded above is the unmodified output.
The whole loop took about ten minutes from "audit this dataset" to "here's a ranked verdict with charts." No scripts, no jq, no spreadsheet exports. Claude reads the data through the MCP tools we publish to everyone else, runs whatever analysis makes sense, and saves the result back where humans can see it. That's the dogfood.
The fix
We took Claude's recommendations literally:
- Collapsed taxonomy to 9 sublists — AI Code Editors & Copilots, AI App Builders, LLM Providers & Foundation Models, Deploy & Cloud Infra, Database & BaaS, Vector & RAG Infrastructure, AI Agent Frameworks, AI Observability & Eval, AI Inference & GPU Cloud, Dev Workflow & Productivity. Every bucket has 8–10 cos. None has 1. None is called "Technology."
- Switched from auto-classification to a curated source — the dataset now reads from
source = vibe_coder_curatedinstead of theagent_ecosystem_roleauto-tag. The 16 noise companies stop showing up because they were never on the curated list. - Added all the missing names — Cursor, Windsurf, GitHub Copilot, Zed, Continue, Cline, Aider, Bolt.new, Lovable, v0, Anthropic, OpenAI, Mistral, DeepSeek, Vercel, Netlify, Railway, Fly.io, Cloudflare Workers, Supabase, Neon, Turso, Convex, Pinecone, Weaviate, Qdrant, Chroma, LangChain, LlamaIndex, CrewAI, Mastra, LangSmith, Helicone, Langfuse, Braintrust, OpenRouter, Hugging Face, Linear, Figma, Warp, Raycast, Sentry, PostHog. All 90 hand-curated.
- Replaced every generic description — each company gets a one-line curated description in the CSV that beats the LLM placeholder text. Then full enrichment runs on top to refresh AR scores and team data.
- Sent everything through the same pipeline — the curated rows aren't a static seed file. They get queued into
od_enrichment_queueand flow through all nine enrichment models (verify, AR scoring, industry, b2b, description refresh, ai_native, agent_role, employee, mx_profile). Plus the headless team extractor runs on Stu to pull leadership from /team and /leadership pages.
Before and after
Before · auto-classified
- 17% off-topic noise
- 36 cos in "Technology" catch-all
- 28 categories with 1 company each
- 0 AI code editors
- 0 LLM providers
- 0 vector databases
- 1 gas station
After · curated
- 0% noise — every entry hand-picked
- 10 cos in AI Code Editors (Cursor, Windsurf, Copilot, Zed…)
- 8–10 per sublist, all named meaningfully
- 10 LLM providers (Anthropic, OpenAI, Google, Mistral…)
- 8 vector dbs (Pinecone, Weaviate, Qdrant, Chroma…)
- 249 enrichment jobs queued
- 0 gas stations
What we actually learned
Auto-classification is a recall tool, not a curation tool. The pipeline is great at saying "out of 24,000 companies, here are the 94 that look AI-native enough to belong on a builder list." That's recall. What it cannot do is judgment — deciding whether a Boise IT MSP that scored as a "tool provider" actually belongs on a Vibe Coder Stack list. Recall is a machine job. Judgment is a human job (or, in this case, a dogfood-the-MCP job).
Curation doesn't replace the pipeline — it sits on top of it. Every curated row still flows through the full enrichment pipeline. We don't lose the AR scoring, the team extraction, the industry classification, or the description refresh. We just get to control the seed list. That distinction matters a lot when you're shipping a public dataset.
The MCP is the fastest audit tool we have. Asking Claude to look at our own data through the same MCP we ship to customers turned a "we should clean up the vibe coder list someday" into a 3-hour ship. The artifact above isn't a one-off — it's a pattern. If you publish data, you should be auditing your own data through the same interface your customers use.
Part 2 — Does the taxonomy hold up against the embeddings?
The first half of this post was about killing the auto-classification noise and standing up a clean, hand-curated 9-sublist taxonomy. Once that landed, the obvious next question is whether the taxonomy we just hand-built actually matches the semantic structure of the companies we put in it. Are AI Code Editors really one tight cluster in embedding space? Or is "Cursor" semantically closer to "Vercel" than it is to "Aider"?
To find out, we ran every one of the 90 curated companies through Stu's all-minilm embedding model (384 dimensions), then trained three classifiers leave-one-out and asked them to re-predict the human label from the embedding alone. We also ran k-means with k=10 (matching the number of sublists) for an unsupervised second opinion.
Companies
93
all 93 fully embedded
Sublists
10
curated taxonomy
Best accuracy
62%
centroid LOO
Best macro F1
0.62
centroid wins
Macro F1 by classifier
Three classifiers, each evaluated leave-one-out on the 93 curated companies. Centroid = mean embedding per sublist + cosine similarity. kNN = k=5 cosine. LogReg = one-vs-rest L2 on raw 384-dim vectors.
Per-sublist F1 — all 3 models side by side
PCA projection of 384-dim all-minilm embeddings into 2D. PC1 + PC2 capture ~13% of variance — the structure is real but high-dimensional. Hover any dot to see the company; toggle the coloring to compare ground-truth labels against what the model thinks.
Centroid classifier confusion matrix (best model). Rows = true sublist, columns = predicted sublist. Diagonal = correct. Off-diagonal hot spots show where the embedding space pulls one sublist toward another — usually because their company descriptions share too much vocabulary.
Companies the centroid model predicted into a different sublist than the human curator. Most are not "wrong" — they're cross-cutting companies whose embeddings legitimately span multiple buckets (Vercel ships an AI SDK, OpenAI ships an inference API, Hugging Face is both a model hub and an inference cloud). Each one is a candidate for cross-list membership in od_business_lists.
k-means clusters
10
unsupervised
Overall purity
54%
vs taxonomy
Pure clusters
3
≥80% one sublist
k-means run on raw embeddings with k = number of sublists, then each cluster scored by its dominant true label. 100% purity means every company in that cluster belongs to the same human-curated sublist — strong agreement between embedding geometry and taxonomy. Lower purity = more mixing.
What the model accidentally taught us
Centroid wins, and 62% accuracy is honest. The simplest possible model — one mean embedding per sublist, cosine similarity to predict — beat both kNN and logistic regression. That's a tell: the sublists are real (not noise), but they overlap enough that more flexible models start overfitting to the wrong axes. The 384-dim space has more directions than 93 cos can pin down.
The "wrong" answers are the most interesting part. Of the 35 companies the centroid classifier mis-predicted, most aren't actually wrong — they're cross-cutting:
- Vercel → AI Agent Frameworks. Labeled Deploy & Cloud, but the AI SDK is so prominent in their site copy that the embedding pulls them toward agent infrastructure. Both labels are correct.
- OpenAI / xAI / DeepSeek → AI Inference & GPU Cloud. Labeled LLM Providers, but they describe themselves heavily in API/tokens/latency language. The embedding can't tell "we make the model" from "we serve the model."
- Hugging Face → LLM Providers. Labeled AI Inference (because of Inference Endpoints), but the model hub identity dominates the embedding.
- Chroma → AI Code Editors. A vector DB that gets pulled toward IDEs because both heavily use "embedding" and "context" in their copy.
None of these are mis-tags. They're candidates for cross-list membership — exactly the case the new od_business_lists table is designed for. A company can have a curated entry in one list and a predicted entry in another, both legitimate.
k-means cluster purity = 54%
Without any labels, k-means recovers about half of the curated structure. Some clusters are dead-on: the AI App Builder cluster is 82% pure (Bolt, Lovable, v0, Replit all stick together), AI Code Editors hits 80% (Cursor, Windsurf, Copilot, Zed). But the AI Inference + LLM Providers + Vector clusters all bleed together because their company descriptions share too much vocabulary. That's the embedding model's ceiling — all-minilm is small (384 dims) and trained on general web text, so it can tell "AI agent" from "restaurant" easily but struggles to separate "vector database" from "inference cloud" when both companies describe themselves with the same 30 words.
How we used MCP this time
This second analysis ran through a different MCP path than the first one. The original audit used query_custom_dataset to read the Vibe Coder Stack rows. This one needed the raw 384-dim vectors, which the public MCP doesn't expose (and probably shouldn't — vectors are big, opaque, and rarely useful to a Claude session). So we did this analysis locally against Supabase, then wrote the centroid-per-sublist model out to scripts/centroids-v1.json and stood up predict_list_membership as the production interface. The MCP gets the human-readable result ("this company looks like Vector & RAG Infrastructure with 81% confidence"); the raw vectors stay in the database. That's the right division of labor.
The v1 prediction model is shipping with this post
Every curated list now has its own centroid file (scripts/centroids-v1.json, scoped by list_slug). When a new company gets ingested:
- Stu generates the embedding (Ollama all-minilm, ~50ms)
- The centroid predictor finds the nearest sublist for each curated list
- Predictions with confidence ≥ 0.55 get written to
od_business_listswithsource='predicted',model='centroid_v1', and the cosine similarity as the confidence score - Curated entries (
source='curated') always win on display, but predicted entries surface in cross-list discovery and "you might also belong in…" suggestions
The schema rule from yesterday's incident still holds: industry classification lives in super_category and is owned by industry_v2; list memberships live in od_business_lists and are owned by curators (source='curated') or the centroid model (source='predicted'). The two never collide.
See the rebuilt Vibe Coder Stack
90 companies, 9 sublists, every entry hand-curated and pipeline-enriched. Click any sublist to see the full ranked list with AR scores, B2B/B2C tags, and team data.