Data quality

The night we wrote our data's axioms.

From a 'Not found' page to 9 of 9 hard data invariants enforced. A six-hour autopilot cleanup that collapsed 411 duplicate company rows, canonicalized 1,042 ugly slugs, and backfilled 21,968 missing profiles — and the daily job that keeps it that way.

Cam Fortin · April 2026

The night started with a single broken page. onlydata.club/graph returned Not found. I'd merged the in-memory graph explorer earlier in the day. Push, deploy, refresh — 404. Easy fix incoming, I figured.

Six hours later we had eleven written-down data invariants, a runner that audits the database against every one of them, autofix scripts for the safe ones, a duplicate detector that knows the difference between a duplicate and a chain restaurant location, and a daily job baked into server boot that re-runs the whole suite forever.

This is what the night looked like.

The peeling onion: it was never just one bug

The 404 was a route-order issue. The Express catch-all app.get('/:slug') at line 7142 was matching /graph before the explicit app.get('/graph') handler at line 16017. Trivial. Move the route up, push, refresh — graph page loads, but the stats sidebar shows zeros.

Zeros is worse than 404. The page was working; the data wasn't. /api/graph/stats returned builtAt set to a real timestamp but totalNodes: 0. The build was running and "succeeding" with no data.

The pattern that hid this is one of the worst lines in our codebase:

const { data, error } = await supabase.from('od_profiles').select(...)
if (error || !data || data.length === 0) break

If a query errors, the loop silently breaks. The error never propagates. The build finishes, reports stats, exits clean — and the graph is empty. We were referencing five columns and a table that don't exist in production: ai_focus, parent_business_id, linked_business_id, od_dataset_rows, and a visibility filter on the wrong column. Every batch errored. The loop swallowed it. The graph stayed empty.

Fix the column names, throw on error instead of swallowing, push. Now the graph is populated. 17,963 nodes, 22,933 edges. Open the page. Run the "Industry Network" canned query for "Foundation Models." There are four results. Two of them are the same AI21 Labs row, listed twice.

The duplicate that taught us about pagination

The graph builder paginated od_businesses in 1000-row chunks using .range(off, off+999) — but with no .order() clause. PostgREST gives no guaranteed order without an explicit ORDER BY. The same row could appear in two different pages. addNode deduped by id (so the node count was right), but addEdge didn't — so the AI21 → Foundation Models industry edge got added twice. The canned query traversed the industry's incoming edges and faithfully returned the same node twice.

Two fixes: stable .order('id') on every pagination call, and addEdge deduplication via a key set. AI21 stops appearing twice. So do thousands of other phantom duplicates we hadn't even noticed yet.

This was when the framing changed for me. Every fix was a one-liner, but every fix surfaced a deeper issue. We didn't have a graph bug. We had a data quality bug, and we had no systematic way to know how many more were lurking.

The slug nightmare

I clicked through to my own profile, /camfortin. I had six job entries in od_team_links — D&B, NetWise, Wine.com, Bain, Product Hacker, ZoomInfo. The graph showed two of them. The other four had company_slug values like producthacker and winecom. Looked up producthacker.ai in od_businesses — the row's slug was producthacker-ai-enk. wine.com was at wine-com-xzp. bain.com resolved to a row called "Bain Vector" (a curated AI-education entry that had grabbed bain.com's domain by accident).

The team_link said producthacker. The business said producthacker-ai-enk. The graph join on slug dropped silently. So did Jody's identical row. So did everyone else's at any company whose business row had been ingested with a random suffix.

Where did the random suffixes come from? grep:

$ grep -n 'Math.random().toString(36).slice(2, 5)' server.js
8651:        slug: slug + '-' + Math.random().toString(36).slice(2, 6),
11161:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),
11200:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),
11900:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),
12563:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),

Five copy-pasted sites. Every single new business row appended a random three-character suffix to its slug, regardless of whether the clean slug was free. So OpenAI shipped as openai-006. Anthropic shipped as anthropic-9njc. Google as google-lvv. Cognition as cognition-ai-devin. The team_links table — written by a different code path that did use the canonical names — had been pointing into the void for who knows how long.

From "go fix it" to axioms

The temptation here is to grep for similar bugs and fix them one by one. We did some of that. But the deeper problem isn't bugs — it's that we had no written-down rules for what the data must look like. Every developer assumed something. Most of those assumptions were correct. The wrong ones rotted silently for months.

So we wrote them down. data/axioms.json — eleven invariants, machine-readable, each one with a description, a check spec, an autofix path, and a severity:

ID	Name	Severity	What it enforces
`A1`	one_domain_one_business	hard	At most one active row per domain. Sub-locations and products get NULL.
`A2`	unique_active_slug	hard	At most one active row per slug.
`A3`	no_ugly_ingest_suffix	soft	Slug should equal slugified name when the canonical form is free.
`A4`	unique_linkedin_url	hard	At most one person profile per LinkedIn URL.
`A5`	team_link_company_resolves	hard	Every team_link.company_slug must reference an active business slug.
`A6`	team_link_person_resolves	hard	Every team_link.person_slug must reference a person profile.
`A7`	at_most_one_primary_per_person	hard	One is_primary=true team_link per person.
`A10`	active_business_has_name_and_slug	hard	No active row may have empty name or slug.
`A11`	active_business_has_profile	hard	Every active business must have a matching company profile (FK target).
`D1`	domain_format_clean	hard	No protocol, www., path, or query in stored domain.
`DS1`	dataset_visibility_enum	hard	visibility ∈ {public, private}.

The severity field matters. hard means the data is wrong if this fails. soft is aspirational — the data isn't broken, it's just ugly. The runner exits non-zero on hard violations so a CI hook can fail the build; soft violations are reported and not gating.

Then a runner: scripts/check-axioms.mjs. It loads every axiom, runs the check, reports counts and samples, and (with --execute) applies the autofix where the axiom has a safe one. The first audit:

A1  ✓ hard   one_domain_one_business              v=0
A2  ✓ hard   unique_active_slug                   v=0
A3  ✗ soft   no_ugly_ingest_suffix                v=17,419
A4  ✓ hard   unique_linkedin_url                  v=0
A5  ✗ hard   team_link_company_resolves           v=1,478
A6  ✓ hard   team_link_person_resolves            v=0
A7  ✗ hard   at_most_one_primary_per_person       v=41
A10 ✗ hard   active_business_has_name_and_slug    v=19
A11 ✗ hard   active_business_has_profile          v=21,968
D1  ✓ hard   domain_format_clean                  v=0
DS1 ✓ hard   dataset_visibility_enum              v=0

Not great. 21,968 active businesses had no matching company profile, which meant od_team_links's FK to od_profiles.slug couldn't resolve, which meant the graph join was structurally broken for almost every employer in the database.

The loop

What followed was a four-hour autopilot. I told Claude Code /loop with the goal "keep cleaning until the audit holds," and stepped away. Each iteration:

Read the current axiom audit.
Pick the highest-impact violation.
Write or extend an autofix script.
Dry-run on a sample, verify the diff is sane.
Execute, re-audit, commit.
Sleep 25 minutes (cache budget), wake up, repeat.

Some highlights from the trail:

1,042 ugly slugs canonicalized. The --all-ugly mode of canonicalize-business-slugs.mjs uses a slugify-name heuristic: a slug is "ugly" only if removing a trailing -XXX leaves exactly the slugified name. That distinguishes anthropic-9njc (ingest suffix on "Anthropic") from a-cut-above-barber-shop (where -shop is part of the actual name). 76 inactive collisions auto-freed in the process.
20,468 company profiles backfilled. A11's autofix bulk-inserts a minimal company profile for every active business missing one. Three rows kept failing on a chk_company_headline constraint until we added per-row fallback to surface the actual error and trim headlines to 140 chars.
411 active duplicate businesses merged via detect-active-dupes.mjs + merge-active-dupes.mjs. The detector buckets active rows by normalized name and domain stem, then suggests winners. The merger applies three safety gates before collapsing anything: domains must agree (or only one set), cities must agree (or only one set), and the loser slug must look like an ingest variant of the winner. microsoft-3ch and microsoft--m83lm2 collapsed into microsoft. hugging-face into huggingface-co. google-deepmind into deepmind-google.
159 brand chains correctly excluded from merge. McDonald's appears 66 times in the database — but those are 66 different physical locations, not 65 duplicates. The detector skips any group with ≥5 same-name rows and routes them to a separate domain-hygiene pass instead.
16 known orgs explicitly seeded (Hugging Face, MIT, Stanford, AI2, Google DeepMind, Carnegie Mellon, etc.) because their team_link references named real institutions that simply didn't exist in od_businesses yet.

Where we ended up

Before · 8 PM

2 / 11

axioms holding

17,419 ugly business slugs
1,478 broken team_links
21,968 missing company profiles
41 people with multiple primary employers
19 active rows with no name or slug
/graph page returning empty data
AI21 Labs duplicated in canned queries

After · 2 AM

9 / 9

hard axioms holding

15 ugly slugs (soft, edge cases)
~660 broken team_links — orphan refs needing manual review
0 missing company profiles
0 people with multi-primary
0 rows missing name/slug
/graph populated and indexed
0 phantom duplicates anywhere

1,042

ugly slugs cleaned

411

dupes auto-merged

20,468

profiles backfilled

9 / 9

hard axioms holding

Going forward: enforcement, not vigilance

None of this is worth anything if it rots back. The whole point of axioms is that they get checked automatically, not that some human remembers to run a script.

So the last commit of the night wires it all into server boot:

app.listen(PORT, () => {
  buildGraph(supabase).then(...)

  const runDailyDQ = async () => {
    const child = spawn('node',
      ['scripts/daily-data-quality-check.mjs', '--execute'],
      { cwd: __dirname, env: process.env })
    child.on('close', code => {
      console.log(`[dq] daily check exit=${code}`)
      if (code === 1) buildGraph(supabase) // rebuild after fixes
    })
  }
  setTimeout(runDailyDQ, 5 * 60 * 1000)
  setInterval(runDailyDQ, 24 * 60 * 60 * 1000)
})

Five minutes after every server boot, then every 24 hours after that, the daily orchestrator runs in-process. It executes the full axiom suite with --execute (so safe autofixes apply), runs the slug canonicalizer in --all-ugly mode, heals any team_link aliases pointing at known-canonical slugs, re-audits, and rebuilds the graph if anything changed. The whole report goes to Railway's stdout where it's grep-able.

Exit code is the contract: 0 means clean, 1 means some hard axiom is still violating after autofix and a human needs to look. The code path is the same one you can run by hand:

SUPABASE_SERVICE_KEY=… node scripts/daily-data-quality-check.mjs --execute

Tier 1: prevent at write time

The daily job catches drift. The real fix for the slug nightmare was upstream: replace those five copy-pasted Math.random sites with a single assignBusinessSlug() helper that does the right thing once and gets reused everywhere.

async function assignBusinessSlug(desired) {
  const base = slugify(desired)
  const { data: holders } = await supabase
    .from('od_businesses').select('id, is_active, merged_into_id')
    .eq('slug', base)
  if (!holders?.length) return base                  // free → take it
  if (holders.every(h => !h.is_active || h.merged_into_id)) {
    // only inactive/merged rows are blocking — free them and take the slug
    for (const h of holders) {
      await supabase.from('od_businesses')
        .update({ slug: `${base}-deprecated-${h.id}` })
        .eq('id', h.id)
    }
    return base
  }
  // an active, non-merged row genuinely owns this slug — append a suffix
  return `${base}-${Math.random().toString(36).slice(2, 5)}`
}

Now new businesses get clean URLs by default. The ugly suffix only appears when a different active company actually owns the canonical slug. Every team_link written against a canonical name resolves on insert instead of joining a future cleanup pass.

Tier 2: detect, don't auto-fix what's irreversible

The duplicate detector is intentionally separated from the merger. Detection runs in the daily job and is logged. Merging is not automatic when it's ambiguous — the detector surfaces 996 candidate pairs after the safe ones were collapsed, and the next round needs human eyes. Auto-merging two real different companies is much worse than leaving two slightly-different rows.

Tier 3: write down the rules

The most under-rated change is the smallest one: data/axioms.json exists. New invariants get a JSON entry, a check function, optionally an autofix. The runner picks them up automatically. Every contributor and every Claude session can see the rules in one file, evaluate them in one command, and watch the same exit code.

The lesson

Silent error-swallowing is worse than crashing.

Every "fix" tonight uncovered another error that a previous "fix" had hidden. The catch-all if (error || …) break pattern made every batch look like a successful empty result. Throwing on errors would have caught the missing columns, the missing profiles, and the pagination drift in the first hour we shipped them. We're now linting for that pattern across the codebase.

Why this matters for an open-data company

OnlyData's pitch is that we're the entity layer for the agent economy — agents and humans hit the same canonical IDs, the same domain, the same person. That promise is worth zero if our own internal data is full of anthropic-9njc and seven Microsoft duplicates. Tonight wasn't fun. Doing it once means we never have to do it again.

The axioms file is the source of truth. The runner is at scripts/check-axioms.mjs. The daily orchestrator is at scripts/daily-data-quality-check.mjs. Pull requests welcome.

Use the data we cleaned

Every company on OnlyData now has a canonical slug, a profile, and zero duplicates. Hit the MCP, browse the graph, query the API.

Explore the graph → Read the API docs