FULL-COMPANY ARCHIVE

Company Archive Intelligence

Name: Company Archive Intelligence
Creator: Gerra

Complete operating histories of real companies - code, business data, communications, documents, and databases - as a training corpus for frontier models.

FULL HISTORYJSON · Parquet · mbox · SQLOne-time archive

Whole-company

Operating history

Founder-owned

Consented + sanitized

10+

Surface classes

7-step

Provenance chain

Sample Schema Included Methodology Evals Graders Application Integration Research

01Sample

See the data

Representative records in the exact shape we deliver. Real provenance and full slices are shared under license.

Application database surface (anonymized)

Representative surface-inventory row, not raw data. A database is represented by counts, schema, dates, and size - never document payloads.

archive_manifest.jsonlrepresentative

{
  "archive_id": "arc_company_a",
  "surface_id": "srf_0042",
  "surface_class": "database",
  "source_system": "mongodb",
  "record_count": 4200000,
  "count_basis": "estimate",
  "date_start": "2021-02-01",
  "date_end": "2026-05-01",
  "storage_bytes": 18800000000,
  "schema_summary": "{collections:[users,sessions,events,orders],indexes:12}",
  "linkability": "joins to code surface via account_id and to comms via user_id",
  "sensitivity": "pii_present_deidentification_required",
  "provenance": "founder_owned_consented",
  "confidence": "confirmed"
}

Code surface integrity-checked by the evaluator

Representative. count_basis is exact here because it comes from a real repo evaluator run (eval_kit v0.6.2).

archive_manifest.jsonlrepresentative

{
  "archive_id": "arc_company_a",
  "surface_id": "srf_0008",
  "surface_class": "code",
  "source_system": "git",
  "record_count": 955,
  "count_basis": "exact",
  "date_start": "2024-05-01",
  "date_end": "2026-06-01",
  "storage_bytes": 60290000000,
  "schema_summary": "{repos:3,files:982,primary_language:TypeScript,has_test_runner:true}",
  "linkability": "issue refs join to comms and project surfaces",
  "sensitivity": "secrets_stripped",
  "provenance": "founder_owned_consented_evaluated",
  "confidence": "confirmed"
}

Broker / trade data surface

Representative. The ambient exhaust an archive deliberately captures - orders, fills, positions, P&L - that ordinary workspace exports miss.

archive_manifest.jsonlrepresentative

{
  "archive_id": "arc_company_b",
  "surface_id": "srf_0117",
  "surface_class": "market_trade_data",
  "source_system": "broker_export",
  "record_count": 310000,
  "count_basis": "estimate",
  "date_start": "2023-01-01",
  "date_end": "2025-12-31",
  "storage_bytes": 2400000000,
  "schema_summary": "{tables:[orders,fills,positions,pnl_snapshots],fields:64}",
  "linkability": "strategy configs join to code surface; research notebooks join to docs",
  "sensitivity": "no_pii_account_ids_masked",
  "provenance": "founder_owned_consented",
  "confidence": "needs_review"
}

02Schema

Record shape

Every field, its type, whether it can be null, and a representative value.

Field	Type	Constraint	Description
archive_id	string	required	Stable id for one company archive (one founder-owned entity). e.g. arc_company_a
surface_id	string	required	Stable id for one data surface within the archive (a repo, DB, mailbox, log store, dataset). e.g. srf_0042
surface_class	string	required	Category: code, database, object_storage, logs, comms, docs, scraped_data, market_trade_data, legal_finance, llm_eval_traces. e.g. database
source_system	string	nullable	Originating system or tool. e.g. mongodb
record_count	int	nullable	Count of the primary record (commits, rows, files, emails, messages, objects, orders). e.g. 4200000
count_basis	string	required	Whether record_count is exact or an estimate. Estimates are never presented as exact. e.g. estimate
date_start	string · ISO-8601 date	nullable	Coverage start. e.g. 2021-02-01
date_end	string · ISO-8601 date	nullable	Coverage end. e.g. 2026-05-01
storage_bytes	int · bytes	nullable	Approximate logical footprint of the surface. e.g. 18800000000
schema_summary	string · json	nullable	Field, table, or collection inventory: names, types, indexes - structure without payload. e.g. {collections:[users,orders],indexes:12}
linkability	string	nullable	How this surface joins to others in the archive (shared user/account/ticket/repo keys). e.g. joins to code via issue refs
sensitivity	string	required	PII or confidentiality level and de-identification path. e.g. pii_present_deidentified
provenance	string	required	Ownership and consent plus how the surface was accessed; founder-owned, consented, sanitized. e.g. founder_owned_consented
confidence	string	required	Confidence in company association: confirmed, likely, needs_review. e.g. confirmed

03What's included

Operational Record

Email, chat, documents, and project history from real companies - how work actually happened, end to end.

Systems & Data

Application databases, logs, warehouse queries, and broker/market data - the structured exhaust of a running business.

Decision Trail

Strategy docs, research, and legal/finance materials that connect decisions to outcomes across a company lifecycle.

04Methodology

How it is built

01
Founder-owned, consented sourcing
Archives are assembled only from companies the founder owns or founded, with consent. This is the property that separates whole-company corpora from scraped data.
02
Whole-surface inventory
Every accessible data surface is inventoried - code and repo history, application databases, logs and telemetry, communications, docs, stored datasets, market and trade data, legal and finance, and LLM/eval traces - represented by counts, schemas, date ranges, sizes, and de-identification notes, never raw payloads.
03
Cross-surface linkage
Surfaces are joined where they share keys (a repo to the issues that reference it, a database account to the contract that names it, comms to the work they coordinate) so the archive reads as one operating history, not disconnected dumps.
04
Sanitization and de-identification
Credentialed remotes, secrets, and raw PII are stripped; sensitive categories are represented through counts, schema, date ranges, and reviewed sample pathways. Buyer-facing names are sanitized.
05
Provenance chain capture
For datasets that went through real buyer diligence, the full chain is preserved: request, DDQ, data room, scoped trial, delivery, compliance confirmation, and legal closeout.
06
Confidence labeling and point-in-time
Each surface is tagged confirmed, likely, or needs-review with an exact-vs-estimate basis; histories are dated so an operating period can be replayed point-in-time.

05Evals

How we validate

What each evaluation measures and how it is run. Where no benchmark is published, we show the methodology and say so.

Agentic Operating-History Replay

Measures

Whether an agent, given a company archive up to a point in time, would have taken the operating action the company actually took next - the decision, the code change, the customer reply, the trade.

Method

Reconstruct the linked archive as of a cut date; pose the in-flight situation; withhold subsequent surfaces; compare the agent action to the recorded outcome across code, ops, and comms.

Result

Methodology-stage. No trained-model or benchmark result is reported; this is a data-and-methodology contribution.

Codebase Integrity / Packaging Check

Measures

Whether a company repositories are real, substantial, and eval-grade before inclusion.

Method

Run the reference codebase evaluator (eval_kit v0.6.2) per repo to extract commit, PR, issue, CI, and test metrics plus production-quality signals; review before scaling to more repos.

Result

Methodology-stage. These are packaging and integrity checks on real repositories, not an agent-performance benchmark.

06Graders

Ground truth

What correct means for this data, and how it is established.

Ground truth

The company own recorded operating history - the code that shipped, the decisions taken, the customer and counterparty interactions, the trades and database state that actually occurred - reconstructed point-in-time from the linked archive.

How it is established

Replay-based comparison against the real recorded outcome. Integrity is anchored to founder-ownership, explicit consent, and a preserved provenance chain (request, DDQ, data room, trial, delivery, compliance, legal closeout). Codebase surfaces are integrity-checked with the reference evaluator before inclusion.

Agreement

Correctness is anchored to the production record and to consent and provenance rather than a separate human-rater pass. No inter-rater agreement figure is published at this stage.

07Application

Frontier-Model Training

Whole-company corpora - code plus the business context around it - for models that reason about how real organizations operate.

Agentic Eval Ground Truth

Replay real operational histories to test whether an agent would have matched what a real company actually did.

Acquisition-Ready Packages

Curated, consented, provenance-clean archives assembled for AI-training-data buyers and acquisition programs.

08Environment & integration

How you load it

Delivery

S3, Parquet, Custom export, Restricted data room / acquisition package

Formats

JSON, Parquet, mbox, SQL, Git bundle

Auth

Restricted acquisition or training-data package under a signed license. Founder-owned, consented, PII-scrubbed; credentialed remotes and secrets stripped; buyer-facing identities sanitized. Sensitive categories are represented as counts, schema, and date ranges with reviewed sample pathways. Full provenance under NDA.

Cadence

One-time archive of the whole-company operating history, optionally re-snapshotted.

09Related research

Operational TelemetryRead →

Request access.

Restricted-scope evaluation access for qualified teams. We share real samples, full schema, and provenance under a mutual NDA.

Talk to us team@gerra.com

Company Archive Intelligence

See the data

Record shape

Operational Record

Systems & Data

Decision Trail

How it is built

Founder-owned, consented sourcing

Whole-surface inventory

Cross-surface linkage

Sanitization and de-identification

Provenance chain capture

Confidence labeling and point-in-time

How we validate

Agentic Operating-History Replay

Codebase Integrity / Packaging Check

Ground truth

Frontier-Model Training

Agentic Eval Ground Truth

Acquisition-Ready Packages

How you load it

Request access.

Product

Company

Connect