Dogfood · Re-curated

Fixing the Vibe Coder Stack

We asked Claude to audit our own dataset through the OnlyData MCP. The verdict was honest, the gaps were obvious, and the fix shipped the same day.

April 12, 2026 · Cam Fortin · 5 min read

The original list was clever — and broken

The Vibe Coder Stack started as an experiment in auto-curation. We had a classification pipeline that could read a company's website, decide if it was AI-native, and tag it as a tool_provider, infrastructure, or platform. Run that filter across our 24,000+ company catalog and you get a list that looks like a builder-tool universe — for free, no human curation needed.

It pulled in 94 companies. Some of them were great: Replit, Groq, Humanloop, Vast.ai, BentoML, Snyk, Apify, Slack. Real signal. The agent readiness scores worked. The B2B/B2C tags worked. The auto-pipeline was doing real work.

And then it had a gas station in it.

We asked Claude to audit it

Instead of eyeballing the list ourselves, we did the dogfood thing: we opened a Claude session, connected it to our own MCP server, and asked it to analyze the dataset and tell us what was broken. Claude pulled the data through mcp__onlydata__query_custom_dataset, ran the math, and produced this artifact — which it then saved back to our profile via mcp__onlydata__create_artifact:

Original analysis · April 12, 2026 · generated by Claude via the OnlyData MCP

Vibe Coder Stack — OnlyData dataset analysis. 94 companies, 38 categories with interactive charts, insights, and improvement opportunities.

Companies

94

screened from 24k+

Categories

38

too many for 94 cos

Avg AR score

25

out of 100

Noise rate

~17%

off-topic entries

B2B vs B2C split

B2C 73 B2B 20 Both 1
B2C: 73, B2B: 20, Both: 1

Taxonomy quality

Catch-all (41) Specific (53)
Catch-all: 41, Specific: 53

Top categories by company count

Catch-all / problematic Specific, meaningful
Technology 36, Other 5, Communication 4, several at 3.

All 38 categories — the fragmentation problem

Technology: 36, Other: 5, 28 categories with 1 company each.

Company signal quality

16 off-topic 78 on-topic

Red-bordered = geographic data bleed (Boise/NC IT MSPs, medical offices, gas station, construction PM tools).

Top score

100

OnlyData Club

Average

~25

median ~22

Lowest

2

4 companies at floor

AR score distribution

Low 1–10 (6) Mid 11–30 (63) High 31–100 (25)
1-10: 6, 11-20: 36, 21-30: 27, 31-40: 6, 41-50: 16, 51+: 3

Top 15 by AR score

OnlyData Club 100, Increase 60, Product Hacker 52, multiple at 45.

What's genuinely great

Auto-curation actually works — pulling 94 relevant companies from 24k+ is real signal. The seed set is legitimate.
AR score is the killer feature — a 0–100 agent-readiness dimension doesn't exist anywhere else. This alone makes the list worth having.
Specialty category depth — AI Observability, Silicon Photonics, Edge AI Silicon, AI Neoclouds. These show real domain awareness.
Strong anchor companies — Slack, Replit, Groq, Humanloop, Vast.ai, BentoML, Snyk, Apify. The signal is real.
B2B/B2C tagging — useful filter for devs who only care about tools with public APIs.

What to fix

Collapse taxonomy to 8–10 buckets — AI core, dev tools, deploy & cloud, data, security, comms, vertical AI, other. 38 categories for 94 companies is incoherent.
Remove geographic noise — 16 Boise/NC MSPs, a gas station, and an optometrist dilute the list and hurt credibility.
"Technology" is not a category — 38% of all companies in one bucket. Break it apart with clear intent.
Add API access + free tier fields — the two things vibe coders check first. Missing entirely.
Purge generic descriptions — "cutting-edge company specializing in innovative solutions" appears ~30 times. Replace with crisp one-liners.

Missing big players

The obvious omissions for a 2026 vibe coder stack list:

The diagnosis was specific

Claude's audit was unflinching. The shape of the problem:

The kind thing Claude said was that the underlying signal was real. AR scoring worked, the seed companies were legitimate, the B2B/B2C tagging was useful. The infrastructure was sound. The list, as a published artifact, was not.

How we used our own MCP to do this

The dogfood loop

Claude pulled live data from our own catalog, ran the analysis, then wrote it back as an artifact

Three OnlyData MCP tools, one Claude session, no manual exports:

The whole loop took about ten minutes from "audit this dataset" to "here's a ranked verdict with charts." No scripts, no jq, no spreadsheet exports. Claude reads the data through the MCP tools we publish to everyone else, runs whatever analysis makes sense, and saves the result back where humans can see it. That's the dogfood.

The fix

We took Claude's recommendations literally:

Before and after

Before · auto-classified

94 / 38
companies / categories
  • 17% off-topic noise
  • 36 cos in "Technology" catch-all
  • 28 categories with 1 company each
  • 0 AI code editors
  • 0 LLM providers
  • 0 vector databases
  • 1 gas station

After · curated

90 / 9
companies / sublists
  • 0% noise — every entry hand-picked
  • 10 cos in AI Code Editors (Cursor, Windsurf, Copilot, Zed…)
  • 8–10 per sublist, all named meaningfully
  • 10 LLM providers (Anthropic, OpenAI, Google, Mistral…)
  • 8 vector dbs (Pinecone, Weaviate, Qdrant, Chroma…)
  • 249 enrichment jobs queued
  • 0 gas stations

What we actually learned

Auto-classification is a recall tool, not a curation tool. The pipeline is great at saying "out of 24,000 companies, here are the 94 that look AI-native enough to belong on a builder list." That's recall. What it cannot do is judgment — deciding whether a Boise IT MSP that scored as a "tool provider" actually belongs on a Vibe Coder Stack list. Recall is a machine job. Judgment is a human job (or, in this case, a dogfood-the-MCP job).

Curation doesn't replace the pipeline — it sits on top of it. Every curated row still flows through the full enrichment pipeline. We don't lose the AR scoring, the team extraction, the industry classification, or the description refresh. We just get to control the seed list. That distinction matters a lot when you're shipping a public dataset.

The MCP is the fastest audit tool we have. Asking Claude to look at our own data through the same MCP we ship to customers turned a "we should clean up the vibe coder list someday" into a 3-hour ship. The artifact above isn't a one-off — it's a pattern. If you publish data, you should be auditing your own data through the same interface your customers use.

Part 2 — Does the taxonomy hold up against the embeddings?

The first half of this post was about killing the auto-classification noise and standing up a clean, hand-curated 9-sublist taxonomy. Once that landed, the obvious next question is whether the taxonomy we just hand-built actually matches the semantic structure of the companies we put in it. Are AI Code Editors really one tight cluster in embedding space? Or is "Cursor" semantically closer to "Vercel" than it is to "Aider"?

To find out, we ran every one of the 90 curated companies through Stu's all-minilm embedding model (384 dimensions), then trained three classifiers leave-one-out and asked them to re-predict the human label from the embedding alone. We also ran k-means with k=10 (matching the number of sublists) for an unsupervised second opinion.

Embeddings × taxonomy analysis · April 13, 2026 · 3 classifiers, 93 cos, 384-dim all-minilm

Companies

93

all 93 fully embedded

Sublists

10

curated taxonomy

Best accuracy

62%

centroid LOO

Best macro F1

0.62

centroid wins

Macro F1 by classifier

Three classifiers, each evaluated leave-one-out on the 93 curated companies. Centroid = mean embedding per sublist + cosine similarity. kNN = k=5 cosine. LogReg = one-vs-rest L2 on raw 384-dim vectors.

Per-sublist F1 — all 3 models side by side

Centroid kNN (k=5) LogReg

PCA projection of 384-dim all-minilm embeddings into 2D. PC1 + PC2 capture ~13% of variance — the structure is real but high-dimensional. Hover any dot to see the company; toggle the coloring to compare ground-truth labels against what the model thinks.

Centroid classifier confusion matrix (best model). Rows = true sublist, columns = predicted sublist. Diagonal = correct. Off-diagonal hot spots show where the embedding space pulls one sublist toward another — usually because their company descriptions share too much vocabulary.

Companies the centroid model predicted into a different sublist than the human curator. Most are not "wrong" — they're cross-cutting companies whose embeddings legitimately span multiple buckets (Vercel ships an AI SDK, OpenAI ships an inference API, Hugging Face is both a model hub and an inference cloud). Each one is a candidate for cross-list membership in od_business_lists.

k-means clusters

10

unsupervised

Overall purity

54%

vs taxonomy

Pure clusters

3

≥80% one sublist

k-means run on raw embeddings with k = number of sublists, then each cluster scored by its dominant true label. 100% purity means every company in that cluster belongs to the same human-curated sublist — strong agreement between embedding geometry and taxonomy. Lower purity = more mixing.

What the model accidentally taught us

Centroid wins, and 62% accuracy is honest. The simplest possible model — one mean embedding per sublist, cosine similarity to predict — beat both kNN and logistic regression. That's a tell: the sublists are real (not noise), but they overlap enough that more flexible models start overfitting to the wrong axes. The 384-dim space has more directions than 93 cos can pin down.

The "wrong" answers are the most interesting part. Of the 35 companies the centroid classifier mis-predicted, most aren't actually wrong — they're cross-cutting:

None of these are mis-tags. They're candidates for cross-list membership — exactly the case the new od_business_lists table is designed for. A company can have a curated entry in one list and a predicted entry in another, both legitimate.

k-means cluster purity = 54%

Without any labels, k-means recovers about half of the curated structure. Some clusters are dead-on: the AI App Builder cluster is 82% pure (Bolt, Lovable, v0, Replit all stick together), AI Code Editors hits 80% (Cursor, Windsurf, Copilot, Zed). But the AI Inference + LLM Providers + Vector clusters all bleed together because their company descriptions share too much vocabulary. That's the embedding model's ceiling — all-minilm is small (384 dims) and trained on general web text, so it can tell "AI agent" from "restaurant" easily but struggles to separate "vector database" from "inference cloud" when both companies describe themselves with the same 30 words.

How we used MCP this time

This second analysis ran through a different MCP path than the first one. The original audit used query_custom_dataset to read the Vibe Coder Stack rows. This one needed the raw 384-dim vectors, which the public MCP doesn't expose (and probably shouldn't — vectors are big, opaque, and rarely useful to a Claude session). So we did this analysis locally against Supabase, then wrote the centroid-per-sublist model out to scripts/centroids-v1.json and stood up predict_list_membership as the production interface. The MCP gets the human-readable result ("this company looks like Vector & RAG Infrastructure with 81% confidence"); the raw vectors stay in the database. That's the right division of labor.

The v1 prediction model is shipping with this post

Every curated list now has its own centroid file (scripts/centroids-v1.json, scoped by list_slug). When a new company gets ingested:

  1. Stu generates the embedding (Ollama all-minilm, ~50ms)
  2. The centroid predictor finds the nearest sublist for each curated list
  3. Predictions with confidence ≥ 0.55 get written to od_business_lists with source='predicted', model='centroid_v1', and the cosine similarity as the confidence score
  4. Curated entries (source='curated') always win on display, but predicted entries surface in cross-list discovery and "you might also belong in…" suggestions

The schema rule from yesterday's incident still holds: industry classification lives in super_category and is owned by industry_v2; list memberships live in od_business_lists and are owned by curators (source='curated') or the centroid model (source='predicted'). The two never collide.

See the rebuilt Vibe Coder Stack

90 companies, 9 sublists, every entry hand-curated and pipeline-enriched. Click any sublist to see the full ranked list with AR scores, B2B/B2C tags, and team data.