Enrichment Pipeline

How a domain becomes a fully profiled entity. Each stage feeds the next. Changes cascade.

—

Total Entities

—

Fully Enriched

—

AR Scanned

—

Embedded

Stu (local LLM) Railway (HTTP) Derived (computed) Entry point

Stage 1 — Entity Creation

Entity Ingestion entry

Company enters the system via upload, AR scan, MCP tool, or curated dataset load. Domain is normalized and matched against existing entities.

Creates

namedomainwebsitewebsite_rawaddressslug

Stage 2 — Classification (Stu)

Industry Classification industry_v2 stu

Deterministic TYPE_MAP lookup, then LLM fallback (gemma4:e4b), then NAICS/SIC code lookup. 92-99.5% accuracy.

Reads

nametypedescription

Writes

super_categorynaics6naics_descsic4sic_desc

B2B/B2C Classification b2b_v2 stu

LLM classification with category context. Chain-of-thought prompting.

Reads

namesuper_categorydescription

Writes

b2b_b2c

Employee Estimation employee_v1 stu

LLM estimation from name, category, and description context.

Reads

namesuper_category

Writes

employee_estimate

Description Generation description_v1 stu

LLM-generated 1-2 sentence business description from name, category, address.

Reads

namesuper_categoryaddresscity

Writes

description

AI Classification ai_native_v1 + agent_role_v1 stu

Classifies AI adoption level (native/integrated/exploring/none) and agent ecosystem role (platform/tool_provider/infrastructure/consumer).

Reads

namedescriptionsuper_category

Writes

ai_nativeagent_roleai_stack

Stage 3 — External Signals (Railway / HTTP)

Agent Readiness Scan scanDomain() http

Live HTTP scan: robots.txt, sitemap, JSON-LD, OpenAPI, MCP manifest, llms.txt, ai-plugin.json. Soft-404 + Cloudflare detection. Also probes developer/docs/api/developers subdomains for agent endpoints in parallel. Algorithm E (Spread, v3) scoring — webBase 30 + AI bonuses, ~50 ceiling for well-behaved SaaS, 100 for true agent-native.

Reads

domain

Writes

agent_readiness_scoreagent_readiness_detailsdomain_statusdomain_last_checked

MX Profiling mx_profile_v2 railway

DNS MX + SPF + DKIM + DMARC lookup. Provider detection, underlying mailbox intelligence.

Reads

domain

Writes

mx_providermx_scoremx_has_mxmx_profile

Stage 4 — Derived Attributes

Vector Embedding nomic-embed-text stu

384-dim embedding from concatenated name + category + description + city. Used for semantic search and similarity matching.

Reads

namesuper_categorydescriptioncityb2b_b2c

Writes

embedding

Triggers on

description changedsuper_category changedname changed

Data Tier Assignment rules engine derived

tier_1 (curated/scanned/uploaded), tier_2 (regional with domain), tier_3 (no domain, archived).

Reads

sourcedomainwebsite

Writes

data_tier

Entity Deduplication matchRow() derived

Domain match (95), phone (85), name+address (80), fuzzy name dice≥0.7 (50-65), city boost (+10). Soft-merge to od_merge_audit.

Reads

domainnameaddressphonecity

Writes

is_activemerged_into_id

Cascade Rules

When an upstream attribute changes, these downstream attributes must be re-computed:

description changes→re-embed (vector), re-classify b2b_b2c, re-classify ai_native

super_category changes→re-embed (vector), re-classify b2b_b2c, update data_tier, re-estimate employee

name changes→re-embed (vector), re-match (dedup), re-classify industry, regenerate slug

domain changes→re-scan AR, re-profile MX, re-match (dedup), update data_tier

address changes→re-match (dedup), re-embed if city changed

agent_readiness changes→update AR100 ranking, update data_tier if first scan

ai_native changes→compute ai_stack (if native/integrated), update dataset membership

Contact Pipeline

Contacts (people) flow through a strict pipeline with provenance and matching rules. We treat contact data carefully because identity claims have higher stakes than company classifications.

Sources of Contacts

first_party — Person claimed their own profile. Highest trust level.

company_team_extraction — Crawled from a company's own /about, /team, or /leadership pages. First-party from the company's perspective.

linkedin_import — From a CSV upload. Admin-only, never auto-published. Requires explicit approval.

admin — Manually created by an OnlyData admin.

Minimum Viable Record (MVR)

A contact must have a name (real human name, 3-60 chars) and a title or role.

Contacts must NOT include from external sources: email addresses, phone numbers, or guessed LinkedIn URLs. We only attach those when they come directly from the person.

This rule keeps OnlyData from becoming a scraper-aggregated contact database. We're a first-party identity layer.

LinkedIn URL = Canonical Person Identity (1:1)

LinkedIn URL is to a person what a domain is to a company — the canonical key.

One LinkedIn URL maps to exactly one person profile in our database. This is enforced at the database layer.

If we discover a LinkedIn URL after creating a profile, we update — we never duplicate.

Normalization: lowercase, strip trailing slash, strip query params.

Matching Pipeline

1. LinkedIn URL exact match — strongest signal, immediate dedup.

2. Name + company match — same normalized name + same company.

3. Name fuzzy match — same normalized name across any company. Warn, never auto-merge.

Normalized names strip suffixes (Jr, Sr, III, IV, PhD, CFA, MBA, MD, Esq, P.E.) and punctuation.

Contact-Company Relationships

People work on a lot of different things. We model that explicitly so the data reflects reality.

Relationship types: employee, founder, cofounder, board_member, advisor, contractor, consultant, investor, fractional, volunteer, sole_proprietor

Status: active, former, advisory, on_leave

Other fields: is_primary (their main gig), start_date, end_date, source, display_order

Recruiters can filter on primary employment. Researchers can find board members. Investors can find sole operators. The model serves all of them.

Pipeline Stages

1. Crawl — fetch /about, /team, /leadership pages

2. Extract — local LLM extracts name+title pairs from page text

3. Match — check against existing profiles (LinkedIn URL → name+company → fuzzy)

4. MVR validation — reject anything missing name or title

5. Stage — insert into review queue (not yet a profile)

6. Review — admin reviews and approves individual members

7. Publish — approved members become unclaimed person profiles

8. Link — auto-create company relationship

9. Verify — profile becomes "verified" on email check, "claimed" when the real person signs up

Embedding Composition

The embedding vector is computed from a concatenation of multiple fields. Richer input = better semantic matching.

embedding_input = {name} | {super_category} | {description} | {city}, {state} | {b2b_b2c}

// Future: add ai_native, agent_role, funding_stage for richer signal
// Consider: separate embeddings for different search intents