FULL-COMPANY ARCHIVE

Company Archive Intelligence

Complete operating histories of real companies - code, business data, communications, documents, and databases - as a training corpus for frontier models.

FULL HISTORYJSON · Parquet · mbox · SQLOne-time archive
Whole-company
Operating history
Founder-owned
Consented + sanitized
10+
Surface classes
7-step
Provenance chain
01Sample

See the data

Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.

Application database surface (anonymized)

Representative surface-inventory row, not raw data. A database is represented by counts, schema, dates, and size - never document payloads.

archive_manifest.jsonlrepresentative
{
  "archive_id": "arc_company_a",
  "surface_id": "srf_0042",
  "surface_class": "database",
  "source_system": "mongodb",
  "record_count": 4200000,
  "count_basis": "estimate",
  "date_start": "2021-02-01",
  "date_end": "2026-05-01",
  "storage_bytes": 18800000000,
  "schema_summary": "{collections:[users,sessions,events,orders],indexes:12}",
  "linkability": "joins to code surface via account_id and to comms via user_id",
  "sensitivity": "pii_present_deidentification_required",
  "provenance": "founder_owned_consented",
  "confidence": "confirmed"
}

Code surface integrity-checked by the evaluator

Representative. count_basis is exact here because it comes from a real repo evaluator run (eval_kit v0.6.2).

archive_manifest.jsonlrepresentative
{
  "archive_id": "arc_company_a",
  "surface_id": "srf_0008",
  "surface_class": "code",
  "source_system": "git",
  "record_count": 955,
  "count_basis": "exact",
  "date_start": "2024-05-01",
  "date_end": "2026-06-01",
  "storage_bytes": 60290000000,
  "schema_summary": "{repos:3,files:982,primary_language:TypeScript,has_test_runner:true}",
  "linkability": "issue refs join to comms and project surfaces",
  "sensitivity": "secrets_stripped",
  "provenance": "founder_owned_consented_evaluated",
  "confidence": "confirmed"
}

Broker / trade data surface

Representative. The ambient exhaust an archive deliberately captures - orders, fills, positions, P&L - that ordinary workspace exports miss.

archive_manifest.jsonlrepresentative
{
  "archive_id": "arc_company_b",
  "surface_id": "srf_0117",
  "surface_class": "market_trade_data",
  "source_system": "broker_export",
  "record_count": 310000,
  "count_basis": "estimate",
  "date_start": "2023-01-01",
  "date_end": "2025-12-31",
  "storage_bytes": 2400000000,
  "schema_summary": "{tables:[orders,fills,positions,pnl_snapshots],fields:64}",
  "linkability": "strategy configs join to code surface; research notebooks join to docs",
  "sensitivity": "no_pii_account_ids_masked",
  "provenance": "founder_owned_consented",
  "confidence": "needs_review"
}
02Schema

Record shape

Every field, its type, whether it can be null, and a representative value.

FieldTypeConstraintDescription
archive_idstringrequiredStable id for one company archive (one founder-owned entity).
e.g. arc_company_a
surface_idstringrequiredStable id for one data surface within the archive (a repo, DB, mailbox, log store, dataset).
e.g. srf_0042
surface_classstringrequiredCategory: code, database, object_storage, logs, comms, docs, scraped_data, market_trade_data, legal_finance, llm_eval_traces.
e.g. database
source_systemstringnullableOriginating system or tool.
e.g. mongodb
record_countintnullableCount of the primary record (commits, rows, files, emails, messages, objects, orders).
e.g. 4200000
count_basisstringrequiredWhether record_count is exact or an estimate. Estimates are never presented as exact.
e.g. estimate
date_startstring · ISO-8601 datenullableCoverage start.
e.g. 2021-02-01
date_endstring · ISO-8601 datenullableCoverage end.
e.g. 2026-05-01
storage_bytesint · bytesnullableApproximate logical footprint of the surface.
e.g. 18800000000
schema_summarystring · jsonnullableField, table, or collection inventory: names, types, indexes - structure without payload.
e.g. {collections:[users,orders],indexes:12}
linkabilitystringnullableHow this surface joins to others in the archive (shared user/account/ticket/repo keys).
e.g. joins to code via issue refs
sensitivitystringrequiredPII or confidentiality level and de-identification path.
e.g. pii_present_deidentified
provenancestringrequiredOwnership and consent plus how the surface was accessed; founder-owned, consented, sanitized.
e.g. founder_owned_consented
confidencestringrequiredConfidence in company association: confirmed, likely, needs_review.
e.g. confirmed
03What's included

Operational Record

Email, chat, documents, and project history from real companies - how work actually happened, end to end.

Systems & Data

Application databases, logs, warehouse queries, and broker/market data - the structured exhaust of a running business.

Decision Trail

Strategy docs, research, and legal/finance materials that connect decisions to outcomes across a company lifecycle.

04Methodology

How it is built

  1. 01

    Founder-owned, consented sourcing

    Archives are assembled only from companies the founder owns or founded, with consent. This is the property that separates whole-company corpora from scraped data.

  2. 02

    Whole-surface inventory

    Every accessible data surface is inventoried - code and repo history, application databases, logs and telemetry, communications, docs, stored datasets, market and trade data, legal and finance, and LLM/eval traces - represented by counts, schemas, date ranges, sizes, and de-identification notes, never raw payloads.

  3. 03

    Cross-surface linkage

    Surfaces are joined where they share keys (a repo to the issues that reference it, a database account to the contract that names it, comms to the work they coordinate) so the archive reads as one operating history, not disconnected dumps.

  4. 04

    Sanitization and de-identification

    Credentialed remotes, secrets, and raw PII are stripped; sensitive categories are represented through counts, schema, date ranges, and reviewed sample pathways. Buyer-facing names are sanitized.

  5. 05

    Provenance chain capture

    For datasets that went through real buyer diligence, the full chain is preserved: request, DDQ, data room, scoped trial, delivery, compliance confirmation, and legal closeout.

  6. 06

    Confidence labeling and point-in-time

    Each surface is tagged confirmed, likely, or needs-review with an exact-vs-estimate basis; histories are dated so an operating period can be replayed point-in-time.

05Evals

How we validate

What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.

Agentic Operating-History Replay

Measures

Whether an agent, given a company archive up to a point in time, would have taken the operating action the company actually took next - the decision, the code change, the customer reply, the trade.

Method

Reconstruct the linked archive as of a cut date; pose the in-flight situation; withhold subsequent surfaces; compare the agent action to the recorded outcome across code, ops, and comms.

Result

Methodology-stage. No trained-model or benchmark result is reported; this is a data-and-methodology contribution.

Codebase Integrity / Packaging Check

Measures

Whether a company repositories are real, substantial, and eval-grade before inclusion.

Method

Run the reference codebase evaluator (eval_kit v0.6.2) per repo to extract commit, PR, issue, CI, and test metrics plus production-quality signals; review before scaling to more repos.

Result

Methodology-stage. These are packaging and integrity checks on real repositories, not an agent-performance benchmark.

06Graders

Ground truth

What correct means for this data, and how it is established.

Ground truth

The company own recorded operating history - the code that shipped, the decisions taken, the customer and counterparty interactions, the trades and database state that actually occurred - reconstructed point-in-time from the linked archive.

How it is established

Replay-based comparison against the real recorded outcome. Integrity is anchored to founder-ownership, explicit consent, and a preserved provenance chain (request, DDQ, data room, trial, delivery, compliance, legal closeout). Codebase surfaces are integrity-checked with the reference evaluator before inclusion.

Agreement

Correctness is anchored to the production record and to consent and provenance rather than a separate human-rater pass. No inter-rater agreement figure is published at this stage.

07Application

Frontier-Model Training

Whole-company corpora - code plus the business context around it - for models that reason about how real organizations operate.

Agentic Eval Ground Truth

Replay real operational histories to test whether an agent would have matched what a real company actually did.

Acquisition-Ready Packages

Curated, consented, provenance-clean archives assembled for AI-training-data buyers and acquisition programs.

08Environment & integration

How you load it

Delivery

S3, Parquet, Custom export, Restricted data room / acquisition package

Formats

JSON, Parquet, mbox, SQL, Git bundle

Auth

Restricted acquisition or training-data package under a signed license. Founder-owned, consented, PII-scrubbed; credentialed remotes and secrets stripped; buyer-facing identities sanitized. Sensitive categories are represented as counts, schema, and date ranges with reviewed sample pathways. Full provenance under NDA.

Cadence

One-time archive of the whole-company operating history, optionally re-snapshotted.

Request access.

Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.