Data quality

The night we wrote our data's axioms.

From a 'Not found' page to 9 of 9 hard data invariants enforced. A six-hour autopilot cleanup that collapsed 411 duplicate company rows, canonicalized 1,042 ugly slugs, and backfilled 21,968 missing profiles — and the daily job that keeps it that way.

Cam Fortin · April 2026

The night started with a single broken page. onlydata.club/graph returned Not found. I'd merged the in-memory graph explorer earlier in the day. Push, deploy, refresh — 404. Easy fix incoming, I figured.

Six hours later we had eleven written-down data invariants, a runner that audits the database against every one of them, autofix scripts for the safe ones, a duplicate detector that knows the difference between a duplicate and a chain restaurant location, and a daily job baked into server boot that re-runs the whole suite forever.

This is what the night looked like.

The peeling onion: it was never just one bug

The 404 was a route-order issue. The Express catch-all app.get('/:slug') at line 7142 was matching /graph before the explicit app.get('/graph') handler at line 16017. Trivial. Move the route up, push, refresh — graph page loads, but the stats sidebar shows zeros.

Zeros is worse than 404. The page was working; the data wasn't. /api/graph/stats returned builtAt set to a real timestamp but totalNodes: 0. The build was running and "succeeding" with no data.

The pattern that hid this is one of the worst lines in our codebase:

const { data, error } = await supabase.from('od_profiles').select(...)
if (error || !data || data.length === 0) break

If a query errors, the loop silently breaks. The error never propagates. The build finishes, reports stats, exits clean — and the graph is empty. We were referencing five columns and a table that don't exist in production: ai_focus, parent_business_id, linked_business_id, od_dataset_rows, and a visibility filter on the wrong column. Every batch errored. The loop swallowed it. The graph stayed empty.

Fix the column names, throw on error instead of swallowing, push. Now the graph is populated. 17,963 nodes, 22,933 edges. Open the page. Run the "Industry Network" canned query for "Foundation Models." There are four results. Two of them are the same AI21 Labs row, listed twice.

The duplicate that taught us about pagination

The graph builder paginated od_businesses in 1000-row chunks using .range(off, off+999) — but with no .order() clause. PostgREST gives no guaranteed order without an explicit ORDER BY. The same row could appear in two different pages. addNode deduped by id (so the node count was right), but addEdge didn't — so the AI21 → Foundation Models industry edge got added twice. The canned query traversed the industry's incoming edges and faithfully returned the same node twice.

Two fixes: stable .order('id') on every pagination call, and addEdge deduplication via a key set. AI21 stops appearing twice. So do thousands of other phantom duplicates we hadn't even noticed yet.

This was when the framing changed for me. Every fix was a one-liner, but every fix surfaced a deeper issue. We didn't have a graph bug. We had a data quality bug, and we had no systematic way to know how many more were lurking.

The slug nightmare

I clicked through to my own profile, /camfortin. I had six job entries in od_team_links — D&B, NetWise, Wine.com, Bain, Product Hacker, ZoomInfo. The graph showed two of them. The other four had company_slug values like producthacker and winecom. Looked up producthacker.ai in od_businesses — the row's slug was producthacker-ai-enk. wine.com was at wine-com-xzp. bain.com resolved to a row called "Bain Vector" (a curated AI-education entry that had grabbed bain.com's domain by accident).

The team_link said producthacker. The business said producthacker-ai-enk. The graph join on slug dropped silently. So did Jody's identical row. So did everyone else's at any company whose business row had been ingested with a random suffix.

Where did the random suffixes come from? grep:

$ grep -n 'Math.random().toString(36).slice(2, 5)' server.js
8651:        slug: slug + '-' + Math.random().toString(36).slice(2, 6),
11161:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),
11200:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),
11900:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),
12563:        slug: slug + '-' + Math.random().toString(36).slice(2, 5),

Five copy-pasted sites. Every single new business row appended a random three-character suffix to its slug, regardless of whether the clean slug was free. So OpenAI shipped as openai-006. Anthropic shipped as anthropic-9njc. Google as google-lvv. Cognition as cognition-ai-devin. The team_links table — written by a different code path that did use the canonical names — had been pointing into the void for who knows how long.

From "go fix it" to axioms

The temptation here is to grep for similar bugs and fix them one by one. We did some of that. But the deeper problem isn't bugs — it's that we had no written-down rules for what the data must look like. Every developer assumed something. Most of those assumptions were correct. The wrong ones rotted silently for months.

So we wrote them down. data/axioms.json — eleven invariants, machine-readable, each one with a description, a check spec, an autofix path, and a severity:

IDNameSeverityWhat it enforces
A1one_domain_one_businesshardAt most one active row per domain. Sub-locations and products get NULL.
A2unique_active_slughardAt most one active row per slug.
A3no_ugly_ingest_suffixsoftSlug should equal slugified name when the canonical form is free.
A4unique_linkedin_urlhardAt most one person profile per LinkedIn URL.
A5team_link_company_resolveshardEvery team_link.company_slug must reference an active business slug.
A6team_link_person_resolveshardEvery team_link.person_slug must reference a person profile.
A7at_most_one_primary_per_personhardOne is_primary=true team_link per person.
A10active_business_has_name_and_slughardNo active row may have empty name or slug.
A11active_business_has_profilehardEvery active business must have a matching company profile (FK target).
D1domain_format_cleanhardNo protocol, www., path, or query in stored domain.
DS1dataset_visibility_enumhardvisibility ∈ {public, private}.

The severity field matters. hard means the data is wrong if this fails. soft is aspirational — the data isn't broken, it's just ugly. The runner exits non-zero on hard violations so a CI hook can fail the build; soft violations are reported and not gating.

Then a runner: scripts/check-axioms.mjs. It loads every axiom, runs the check, reports counts and samples, and (with --execute) applies the autofix where the axiom has a safe one. The first audit:

A1  ✓ hard   one_domain_one_business              v=0
A2  ✓ hard   unique_active_slug                   v=0
A3  ✗ soft   no_ugly_ingest_suffix                v=17,419
A4  ✓ hard   unique_linkedin_url                  v=0
A5  ✗ hard   team_link_company_resolves           v=1,478
A6  ✓ hard   team_link_person_resolves            v=0
A7  ✗ hard   at_most_one_primary_per_person       v=41
A10 ✗ hard   active_business_has_name_and_slug    v=19
A11 ✗ hard   active_business_has_profile          v=21,968
D1  ✓ hard   domain_format_clean                  v=0
DS1 ✓ hard   dataset_visibility_enum              v=0

Not great. 21,968 active businesses had no matching company profile, which meant od_team_links's FK to od_profiles.slug couldn't resolve, which meant the graph join was structurally broken for almost every employer in the database.

The loop

What followed was a four-hour autopilot. I told Claude Code /loop with the goal "keep cleaning until the audit holds," and stepped away. Each iteration:

  1. Read the current axiom audit.
  2. Pick the highest-impact violation.
  3. Write or extend an autofix script.
  4. Dry-run on a sample, verify the diff is sane.
  5. Execute, re-audit, commit.
  6. Sleep 25 minutes (cache budget), wake up, repeat.

Some highlights from the trail:

Where we ended up

Before · 8 PM

2 / 11
axioms holding
  • 17,419 ugly business slugs
  • 1,478 broken team_links
  • 21,968 missing company profiles
  • 41 people with multiple primary employers
  • 19 active rows with no name or slug
  • /graph page returning empty data
  • AI21 Labs duplicated in canned queries

After · 2 AM

9 / 9
hard axioms holding
  • 15 ugly slugs (soft, edge cases)
  • ~660 broken team_links — orphan refs needing manual review
  • 0 missing company profiles
  • 0 people with multi-primary
  • 0 rows missing name/slug
  • /graph populated and indexed
  • 0 phantom duplicates anywhere
1,042
ugly slugs cleaned
411
dupes auto-merged
20,468
profiles backfilled
9 / 9
hard axioms holding

Going forward: enforcement, not vigilance

None of this is worth anything if it rots back. The whole point of axioms is that they get checked automatically, not that some human remembers to run a script.

So the last commit of the night wires it all into server boot:

app.listen(PORT, () => {
  buildGraph(supabase).then(...)

  const runDailyDQ = async () => {
    const child = spawn('node',
      ['scripts/daily-data-quality-check.mjs', '--execute'],
      { cwd: __dirname, env: process.env })
    child.on('close', code => {
      console.log(`[dq] daily check exit=${code}`)
      if (code === 1) buildGraph(supabase) // rebuild after fixes
    })
  }
  setTimeout(runDailyDQ, 5 * 60 * 1000)
  setInterval(runDailyDQ, 24 * 60 * 60 * 1000)
})

Five minutes after every server boot, then every 24 hours after that, the daily orchestrator runs in-process. It executes the full axiom suite with --execute (so safe autofixes apply), runs the slug canonicalizer in --all-ugly mode, heals any team_link aliases pointing at known-canonical slugs, re-audits, and rebuilds the graph if anything changed. The whole report goes to Railway's stdout where it's grep-able.

Exit code is the contract: 0 means clean, 1 means some hard axiom is still violating after autofix and a human needs to look. The code path is the same one you can run by hand:

SUPABASE_SERVICE_KEY=… node scripts/daily-data-quality-check.mjs --execute

Tier 1: prevent at write time

The daily job catches drift. The real fix for the slug nightmare was upstream: replace those five copy-pasted Math.random sites with a single assignBusinessSlug() helper that does the right thing once and gets reused everywhere.

async function assignBusinessSlug(desired) {
  const base = slugify(desired)
  const { data: holders } = await supabase
    .from('od_businesses').select('id, is_active, merged_into_id')
    .eq('slug', base)
  if (!holders?.length) return base                  // free → take it
  if (holders.every(h => !h.is_active || h.merged_into_id)) {
    // only inactive/merged rows are blocking — free them and take the slug
    for (const h of holders) {
      await supabase.from('od_businesses')
        .update({ slug: `${base}-deprecated-${h.id}` })
        .eq('id', h.id)
    }
    return base
  }
  // an active, non-merged row genuinely owns this slug — append a suffix
  return `${base}-${Math.random().toString(36).slice(2, 5)}`
}

Now new businesses get clean URLs by default. The ugly suffix only appears when a different active company actually owns the canonical slug. Every team_link written against a canonical name resolves on insert instead of joining a future cleanup pass.

Tier 2: detect, don't auto-fix what's irreversible

The duplicate detector is intentionally separated from the merger. Detection runs in the daily job and is logged. Merging is not automatic when it's ambiguous — the detector surfaces 996 candidate pairs after the safe ones were collapsed, and the next round needs human eyes. Auto-merging two real different companies is much worse than leaving two slightly-different rows.

Tier 3: write down the rules

The most under-rated change is the smallest one: data/axioms.json exists. New invariants get a JSON entry, a check function, optionally an autofix. The runner picks them up automatically. Every contributor and every Claude session can see the rules in one file, evaluate them in one command, and watch the same exit code.

The lesson

Silent error-swallowing is worse than crashing.

Every "fix" tonight uncovered another error that a previous "fix" had hidden. The catch-all if (error || …) break pattern made every batch look like a successful empty result. Throwing on errors would have caught the missing columns, the missing profiles, and the pagination drift in the first hour we shipped them. We're now linting for that pattern across the codebase.

Why this matters for an open-data company

OnlyData's pitch is that we're the entity layer for the agent economy — agents and humans hit the same canonical IDs, the same domain, the same person. That promise is worth zero if our own internal data is full of anthropic-9njc and seven Microsoft duplicates. Tonight wasn't fun. Doing it once means we never have to do it again.

The axioms file is the source of truth. The runner is at scripts/check-axioms.mjs. The daily orchestrator is at scripts/daily-data-quality-check.mjs. Pull requests welcome.

Use the data we cleaned

Every company on OnlyData now has a canonical slug, a profile, and zero duplicates. Hit the MCP, browse the graph, query the API.

Explore the graph → Read the API docs