Skip to Main content Skip to Navigation
Journal articles

Software Provenance Tracking at the Scale of Public Source Code

Abstract : We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
Document type :
Journal articles
Complete list of metadata

Cited literature [61 references]  Display  Hide  Download
Contributor : Stefano Zacchiroli Connect in order to contact the contributor
Submitted on : Wednesday, April 15, 2020 - 4:20:48 PM
Last modification on : Friday, August 5, 2022 - 11:54:58 AM


Files produced by the author(s)


  • HAL Id : hal-02543794, version 1



Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli. Software Provenance Tracking at the Scale of Public Source Code. Empirical Software Engineering, Springer Verlag, 2020. ⟨hal-02543794⟩



Record views


Files downloads