Enrichment Pipeline

How a domain becomes a fully profiled entity. Each stage feeds the next. Changes cascade.

Total Entities
Fully Enriched
AR Scanned
Embedded
Stu (local LLM) Railway (HTTP) Derived (computed) Entry point

Stage 1 — Entity Creation

Entity Ingestion entry
Company enters the system via upload, AR scan, MCP tool, or curated dataset load. Domain is normalized and matched against existing entities.
Creates
namedomainwebsitewebsite_rawaddressslug

Stage 2 — Classification (Stu)

Industry Classification industry_v2 stu
Deterministic TYPE_MAP lookup, then LLM fallback (gemma4:e4b), then NAICS/SIC code lookup. 92-99.5% accuracy.
Reads
nametypedescription
Writes
super_categorynaics6naics_descsic4sic_desc
B2B/B2C Classification b2b_v2 stu
LLM classification with category context. Chain-of-thought prompting.
Reads
namesuper_categorydescription
Writes
b2b_b2c
Employee Estimation employee_v1 stu
LLM estimation from name, category, and description context.
Reads
namesuper_category
Writes
employee_estimate
Description Generation description_v1 stu
LLM-generated 1-2 sentence business description from name, category, address.
Reads
namesuper_categoryaddresscity
Writes
description
AI Classification ai_native_v1 + agent_role_v1 stu
Classifies AI adoption level (native/integrated/exploring/none) and agent ecosystem role (platform/tool_provider/infrastructure/consumer).
Reads
namedescriptionsuper_category
Writes
ai_nativeagent_roleai_stack

Stage 3 — External Signals (Railway / HTTP)

Agent Readiness Scan scanDomain() http
Live HTTP scan: robots.txt, sitemap, JSON-LD, OpenAPI, MCP manifest, llms.txt, ai-plugin.json. Soft-404 + Cloudflare detection. Also probes developer/docs/api/developers subdomains for agent endpoints in parallel. Algorithm E (Spread, v3) scoring — webBase 30 + AI bonuses, ~50 ceiling for well-behaved SaaS, 100 for true agent-native.
Reads
domain
Writes
agent_readiness_scoreagent_readiness_detailsdomain_statusdomain_last_checked
MX Profiling mx_profile_v2 railway
DNS MX + SPF + DKIM + DMARC lookup. Provider detection, underlying mailbox intelligence.
Reads
domain
Writes
mx_providermx_scoremx_has_mxmx_profile

Stage 4 — Derived Attributes

Vector Embedding nomic-embed-text stu
384-dim embedding from concatenated name + category + description + city. Used for semantic search and similarity matching.
Reads
namesuper_categorydescriptioncityb2b_b2c
Writes
embedding
Triggers on
description changedsuper_category changedname changed
Data Tier Assignment rules engine derived
tier_1 (curated/scanned/uploaded), tier_2 (regional with domain), tier_3 (no domain, archived).
Reads
sourcedomainwebsite
Writes
data_tier
Entity Deduplication matchRow() derived
Domain match (95), phone (85), name+address (80), fuzzy name dice≥0.7 (50-65), city boost (+10). Soft-merge to od_merge_audit.
Reads
domainnameaddressphonecity
Writes
is_activemerged_into_id

Cascade Rules

When an upstream attribute changes, these downstream attributes must be re-computed:

description changesre-embed (vector), re-classify b2b_b2c, re-classify ai_native
super_category changesre-embed (vector), re-classify b2b_b2c, update data_tier, re-estimate employee
name changesre-embed (vector), re-match (dedup), re-classify industry, regenerate slug
domain changesre-scan AR, re-profile MX, re-match (dedup), update data_tier
address changesre-match (dedup), re-embed if city changed
agent_readiness changesupdate AR100 ranking, update data_tier if first scan
ai_native changescompute ai_stack (if native/integrated), update dataset membership

Contact Pipeline

Contacts (people) flow through a strict pipeline with provenance and matching rules. We treat contact data carefully because identity claims have higher stakes than company classifications.

Sources of Contacts
first_party — Person claimed their own profile. Highest trust level.
company_team_extraction — Crawled from a company's own /about, /team, or /leadership pages. First-party from the company's perspective.
linkedin_import — From a CSV upload. Admin-only, never auto-published. Requires explicit approval.
admin — Manually created by an OnlyData admin.
Minimum Viable Record (MVR)
A contact must have a name (real human name, 3-60 chars) and a title or role.
Contacts must NOT include from external sources: email addresses, phone numbers, or guessed LinkedIn URLs. We only attach those when they come directly from the person.
This rule keeps OnlyData from becoming a scraper-aggregated contact database. We're a first-party identity layer.
LinkedIn URL = Canonical Person Identity (1:1)
LinkedIn URL is to a person what a domain is to a company — the canonical key.
One LinkedIn URL maps to exactly one person profile in our database. This is enforced at the database layer.
If we discover a LinkedIn URL after creating a profile, we update — we never duplicate.
Normalization: lowercase, strip trailing slash, strip query params.
Matching Pipeline
1. LinkedIn URL exact match — strongest signal, immediate dedup.
2. Name + company match — same normalized name + same company.
3. Name fuzzy match — same normalized name across any company. Warn, never auto-merge.
Normalized names strip suffixes (Jr, Sr, III, IV, PhD, CFA, MBA, MD, Esq, P.E.) and punctuation.
Contact-Company Relationships
People work on a lot of different things. We model that explicitly so the data reflects reality.
Relationship types: employee, founder, cofounder, board_member, advisor, contractor, consultant, investor, fractional, volunteer, sole_proprietor
Status: active, former, advisory, on_leave
Other fields: is_primary (their main gig), start_date, end_date, source, display_order
Recruiters can filter on primary employment. Researchers can find board members. Investors can find sole operators. The model serves all of them.
Pipeline Stages
1. Crawl — fetch /about, /team, /leadership pages
2. Extract — local LLM extracts name+title pairs from page text
3. Match — check against existing profiles (LinkedIn URL → name+company → fuzzy)
4. MVR validation — reject anything missing name or title
5. Stage — insert into review queue (not yet a profile)
6. Review — admin reviews and approves individual members
7. Publish — approved members become unclaimed person profiles
8. Link — auto-create company relationship
9. Verify — profile becomes "verified" on email check, "claimed" when the real person signs up

Embedding Composition

The embedding vector is computed from a concatenation of multiple fields. Richer input = better semantic matching.

embedding_input = {name} | {super_category} | {description} | {city}, {state} | {b2b_b2c}

// Future: add ai_native, agent_role, funding_stage for richer signal
// Consider: separate embeddings for different search intents