CODEBASE TRAINING DATA

Codebase Intelligence

Full-history private codebases - commits, pull requests, reviews, and linked issues - packaged as training data for coding agents and SWE evals.

FULL HISTORYGit, JSON, ParquetOne-time archive or periodic snapshot

Overview

Repositories1.4K+ private, multi-company
Commits150K+ with full diff history
ArtifactsPRs, reviews, issues, CI, releases
LanguagesPolyglot - web, ML, systems, contracts
ProvenanceFounder-owned, consented, sanitized
DeliveryS3, Git bundle, Parquet

What's included

Source & Review History

Commits, diffs, pull requests, and review threads with author and repo entity resolution. Real engineering decisions, not synthetic tasks.

Issue-to-Code Linkage

Tickets and project history joined to the commits and PRs that resolved them - end-to-end traces for SWE-agent evals.

Build & Release Signal

CI runs, release tags, and deploy history - the full lifecycle from issue to shipped code.

Application

Coding-Agent Training

Real multi-year SDLC behavior across many codebases - grounded supervision for agents that read, modify, and ship inside real repos.

SWE Eval Harness

Point-in-time repo snapshots with the linked issue and the human PR that fixed it. Score an agent against what real engineers did.

Provenance-Clean Pretraining

A consented code corpus with surrounding history - distinct from scraped public code of unknown license.

Request trial access.

90-day trial with restricted scope for evaluation. No commitment required.