Abstract

We describe a corpus of 1,400+ private repositories with 150,000+ commits, assembled to ground the training and evaluation of coding agents in real software-engineering behavior rather than synthetic tasks. Each repository carries its full diff history along with pull requests, code reviews, linked issues, CI runs, and release tags.

The corpus is distinguished by two properties. First, issue-to-code linkage: tickets are joined to the specific commits and pull requests that resolved them, yielding end-to-end traces from a stated problem to the change that shipped. Second,clean provenance: the repositories are founder-owned, consented, and sanitized, which separates this corpus from scraped public code of unknown license. This paper is a description of the corpus and of an evaluation methodology built on it. It is a data and methodology contribution and reports no model-training or benchmark results.

Motivation: Synthetic Benchmarks vs Real SDLC

Coding agents are increasingly trained and graded against synthetic or curated benchmarks. Those benchmarks are useful, but they compress software engineering into isolated, self-contained tasks. They tend to omit the surrounding software development life cycle (SDLC): the messy issue that motivated a change, the review comments that reshaped it, the CI signal that gated it, and the release that shipped it. An agent that performs well on a packaged task has not necessarily learned how change happens inside a living codebase.

We argue that the missing ingredient is grounded SDLC data. Real repositories record how engineers actually move from a problem to a merged solution under review and CI. A corpus that preserves those artifacts, and the links between them, can serve as training data that reflects genuine engineering behavior and as an evaluation substrate that asks an agent to reproduce outcomes that humans actually reached. This paper sets out such a corpus and a methodology for using it; it does not report results from either use.

Corpus Composition

The corpus spans more than 1,400 private repositories and more than 150,000 commits. Every commit retains its full diff, and each repository carries the surrounding development artifacts: pull requests with their discussion, code reviews with inline comments, the issues those changes resolved, CI runs, and release tags. The provenance is uniform across the corpus.

Dimension	Value
Repositories	1,400+ private
Commits	150,000+
Diff history	Full, per-commit
Pull requests	Included, with discussion
Code reviews	Included, inline comments
Linked issues	Joined to resolving commits and PRs
CI runs	Included
Release tags	Included
Provenance	Founder-owned, consented, sanitized

1,400+

Repositories

Private repositories, founder-owned and consented for this use.

150,000+

Commits

Each retained with full per-commit diff history.

Artifacts per repo

Diffs, pull requests, code reviews, linked issues, CI runs, and release tags.

The intent of capturing all six artifact types together is to preserve context that is usually discarded when code is collected in isolation. A diff on its own loses the reason it was written and the scrutiny it passed through; kept alongside its issue, review, CI result, and release, the same diff becomes part of a complete account of a change.

Issue-to-Code Linkage

The defining feature of the corpus is that artifacts are joined rather than merely co-located. Tickets are linked to the specific commits and pull requests that resolved them, so each resolved issue carries an end-to-end trace from the stated problem to the change that addressed it, through the review and CI that accompanied it.

Trace stage	Artifact	Role in trace
Problem	Linked issue / ticket	States the task and acceptance context
Change	Commits + diffs	The exact edits that addressed the issue
Proposal	Pull request	Groups the change with author rationale
Scrutiny	Code review	Inline comments, requested changes, approvals
Verification	CI run	Build and test signal at merge time
Outcome	Release tag	Anchors the change in a shipped version

These traces are what make the corpus more than a pile of code. They turn an issue and its resolution into a single object: a problem statement, the exact edits that answered it, the discussion that refined those edits, the CI signal at merge, and the release that carried them. That structure is the basis for the evaluation methodology described next.

Point-in-Time SWE Evaluation Methodology

The issue-to-code traces support a point-in-time SWE evaluation. The idea is to reconstruct the repository as it stood at the moment a given issue was open, present the agent with the issue and that historical state, and then compare the agent's proposed change against the human pull request that actually resolved the issue.

The methodology proceeds along the recorded trace:

Reconstruct state. Use commit history and release tags to recover the repository at the point in time just before the resolving change, so the agent sees only what was available then.
Pose the task. Provide the linked issue as the problem statement, without exposing the resolving commits or pull request.
Score against the human PR. Compare the agent's output to the actual pull request that resolved the issue, the ground-truth change that a human shipped and that passed review and CI.

Because the corpus records the real resolving change for each linked issue, this evaluation grades agents against outcomes that were genuinely reached in production engineering rather than against constructed answers. We describe the methodology here; we do not report scores from running it.

Provenance, Consent, and Privacy

The repositories are founder-owned, contributed with consent, and sanitizedbefore use. This is the property that separates the corpus from scraped public code, where license status is frequently unknown and consent is absent. Here the chain of provenance is explicit: the owners agreed to the use, and the material was processed to remove sensitive content prior to inclusion.

Clean provenance matters for both intended uses. As training data, it gives a defensible basis for the material an agent learns from. As an evaluation substrate, it allows the issue-to-code traces to be used without the licensing ambiguity that clouds public-code benchmarks. We treat provenance and sanitization as first-class attributes of the corpus rather than afterthoughts.

Limitations & Conclusion

We are explicit about the scope of this work. This is a data and methodology contribution. It reports no model-training results and no benchmark results. We describe the composition of the corpus, the issue-to-code linkage that gives it end-to-end traces, and a point-in-time evaluation methodology built on those traces. We do not, in this paper, train any model on the corpus, run the evaluation, or report any scores; no such results exist yet.

Other limits follow from this scope. The corpus is founder-owned private code, which gives clean provenance but a particular and bounded distribution rather than a representative sample of all software. The value of the issue-to-code traces and of the point-in-time methodology is, at this stage, structural and described rather than measured. Quantifying that value, by training on the corpus and by running the evaluation, is future work.

In sum, we contribute a provenance-clean corpus of 1,400+ private repositories with 150,000+ commits and full SDLC history, the issue-to-code linkage that turns resolved tickets into end-to-end traces, and a methodology for point-in-time SWE evaluation that scores an agent against the human pull request that actually resolved an issue. The grounded training and evaluation these enable are the intended uses; results from them are deliberately left to subsequent work.

A Provenance-Clean Codebase Corpus for Coding Agents and SWE Evaluation