FULL-COMPANY ARCHIVE
Company Archive Intelligence
Complete operating histories of real companies - code, business data, communications, documents, and databases - as a training corpus for frontier models.
See the data
Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.
Application database surface (anonymized)
Representative surface-inventory row, not raw data. A database is represented by counts, schema, dates, and size - never document payloads.
{
"archive_id": "arc_company_a",
"surface_id": "srf_0042",
"surface_class": "database",
"source_system": "mongodb",
"record_count": 4200000,
"count_basis": "estimate",
"date_start": "2021-02-01",
"date_end": "2026-05-01",
"storage_bytes": 18800000000,
"schema_summary": "{collections:[users,sessions,events,orders],indexes:12}",
"linkability": "joins to code surface via account_id and to comms via user_id",
"sensitivity": "pii_present_deidentification_required",
"provenance": "founder_owned_consented",
"confidence": "confirmed"
}Code surface integrity-checked by the evaluator
Representative. count_basis is exact here because it comes from a real repo evaluator run (eval_kit v0.6.2).
{
"archive_id": "arc_company_a",
"surface_id": "srf_0008",
"surface_class": "code",
"source_system": "git",
"record_count": 955,
"count_basis": "exact",
"date_start": "2024-05-01",
"date_end": "2026-06-01",
"storage_bytes": 60290000000,
"schema_summary": "{repos:3,files:982,primary_language:TypeScript,has_test_runner:true}",
"linkability": "issue refs join to comms and project surfaces",
"sensitivity": "secrets_stripped",
"provenance": "founder_owned_consented_evaluated",
"confidence": "confirmed"
}Broker / trade data surface
Representative. The ambient exhaust an archive deliberately captures - orders, fills, positions, P&L - that ordinary workspace exports miss.
{
"archive_id": "arc_company_b",
"surface_id": "srf_0117",
"surface_class": "market_trade_data",
"source_system": "broker_export",
"record_count": 310000,
"count_basis": "estimate",
"date_start": "2023-01-01",
"date_end": "2025-12-31",
"storage_bytes": 2400000000,
"schema_summary": "{tables:[orders,fills,positions,pnl_snapshots],fields:64}",
"linkability": "strategy configs join to code surface; research notebooks join to docs",
"sensitivity": "no_pii_account_ids_masked",
"provenance": "founder_owned_consented",
"confidence": "needs_review"
}Record shape
Every field, its type, whether it can be null, and a representative value.
| Field | Type | Constraint | Description |
|---|---|---|---|
| archive_id | string | required | Stable id for one company archive (one founder-owned entity). e.g. arc_company_a |
| surface_id | string | required | Stable id for one data surface within the archive (a repo, DB, mailbox, log store, dataset). e.g. srf_0042 |
| surface_class | string | required | Category: code, database, object_storage, logs, comms, docs, scraped_data, market_trade_data, legal_finance, llm_eval_traces. e.g. database |
| source_system | string | nullable | Originating system or tool. e.g. mongodb |
| record_count | int | nullable | Count of the primary record (commits, rows, files, emails, messages, objects, orders). e.g. 4200000 |
| count_basis | string | required | Whether record_count is exact or an estimate. Estimates are never presented as exact. e.g. estimate |
| date_start | string · ISO-8601 date | nullable | Coverage start. e.g. 2021-02-01 |
| date_end | string · ISO-8601 date | nullable | Coverage end. e.g. 2026-05-01 |
| storage_bytes | int · bytes | nullable | Approximate logical footprint of the surface. e.g. 18800000000 |
| schema_summary | string · json | nullable | Field, table, or collection inventory: names, types, indexes - structure without payload. e.g. {collections:[users,orders],indexes:12} |
| linkability | string | nullable | How this surface joins to others in the archive (shared user/account/ticket/repo keys). e.g. joins to code via issue refs |
| sensitivity | string | required | PII or confidentiality level and de-identification path. e.g. pii_present_deidentified |
| provenance | string | required | Ownership and consent plus how the surface was accessed; founder-owned, consented, sanitized. e.g. founder_owned_consented |
| confidence | string | required | Confidence in company association: confirmed, likely, needs_review. e.g. confirmed |
Operational Record
Email, chat, documents, and project history from real companies - how work actually happened, end to end.
Systems & Data
Application databases, logs, warehouse queries, and broker/market data - the structured exhaust of a running business.
Decision Trail
Strategy docs, research, and legal/finance materials that connect decisions to outcomes across a company lifecycle.
How it is built
- 01
Founder-owned, consented sourcing
Archives are assembled only from companies the founder owns or founded, with consent. This is the property that separates whole-company corpora from scraped data.
- 02
Whole-surface inventory
Every accessible data surface is inventoried - code and repo history, application databases, logs and telemetry, communications, docs, stored datasets, market and trade data, legal and finance, and LLM/eval traces - represented by counts, schemas, date ranges, sizes, and de-identification notes, never raw payloads.
- 03
Cross-surface linkage
Surfaces are joined where they share keys (a repo to the issues that reference it, a database account to the contract that names it, comms to the work they coordinate) so the archive reads as one operating history, not disconnected dumps.
- 04
Sanitization and de-identification
Credentialed remotes, secrets, and raw PII are stripped; sensitive categories are represented through counts, schema, date ranges, and reviewed sample pathways. Buyer-facing names are sanitized.
- 05
Provenance chain capture
For datasets that went through real buyer diligence, the full chain is preserved: request, DDQ, data room, scoped trial, delivery, compliance confirmation, and legal closeout.
- 06
Confidence labeling and point-in-time
Each surface is tagged confirmed, likely, or needs-review with an exact-vs-estimate basis; histories are dated so an operating period can be replayed point-in-time.
How we validate
What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.
Agentic Operating-History Replay
Measures
Whether an agent, given a company archive up to a point in time, would have taken the operating action the company actually took next - the decision, the code change, the customer reply, the trade.
Method
Reconstruct the linked archive as of a cut date; pose the in-flight situation; withhold subsequent surfaces; compare the agent action to the recorded outcome across code, ops, and comms.
Result
Methodology-stage. No trained-model or benchmark result is reported; this is a data-and-methodology contribution.
Codebase Integrity / Packaging Check
Measures
Whether a company repositories are real, substantial, and eval-grade before inclusion.
Method
Run the reference codebase evaluator (eval_kit v0.6.2) per repo to extract commit, PR, issue, CI, and test metrics plus production-quality signals; review before scaling to more repos.
Result
Methodology-stage. These are packaging and integrity checks on real repositories, not an agent-performance benchmark.
Ground truth
What correct means for this data, and how it is established.
Ground truth
The company own recorded operating history - the code that shipped, the decisions taken, the customer and counterparty interactions, the trades and database state that actually occurred - reconstructed point-in-time from the linked archive.
How it is established
Replay-based comparison against the real recorded outcome. Integrity is anchored to founder-ownership, explicit consent, and a preserved provenance chain (request, DDQ, data room, trial, delivery, compliance, legal closeout). Codebase surfaces are integrity-checked with the reference evaluator before inclusion.
Agreement
Correctness is anchored to the production record and to consent and provenance rather than a separate human-rater pass. No inter-rater agreement figure is published at this stage.
Frontier-Model Training
Whole-company corpora - code plus the business context around it - for models that reason about how real organizations operate.
Agentic Eval Ground Truth
Replay real operational histories to test whether an agent would have matched what a real company actually did.
Acquisition-Ready Packages
Curated, consented, provenance-clean archives assembled for AI-training-data buyers and acquisition programs.
How you load it
Delivery
S3, Parquet, Custom export, Restricted data room / acquisition package
Formats
JSON, Parquet, mbox, SQL, Git bundle
Auth
Restricted acquisition or training-data package under a signed license. Founder-owned, consented, PII-scrubbed; credentialed remotes and secrets stripped; buyer-facing identities sanitized. Sensitive categories are represented as counts, schema, and date ranges with reviewed sample pathways. Full provenance under NDA.
Cadence
One-time archive of the whole-company operating history, optionally re-snapshotted.
Request access.
Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.