J. Abramatic, R. D. Cosmo, and S. Zacchiroli, Building the universal archive of source code, Communications of the ACM, vol.61, issue.10, pp.29-31, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02157125

R. Albert and A. Barabási, Statistical mechanics of complex networks, Reviews of modern physics, vol.74, issue.1, p.47, 2002.

C. V. Alexandru, S. Panichella, and H. C. Gall, Reducing redundancies in multi-revision code analysis, IEEE 24th International Conference on Software Analysis, Evolution and Reengineering, pp.148-159, 2017.

C. V. Alexandru, S. Panichella, S. Proksch, and H. C. Gall, Redundancy-free analysis of multi-revision software artifacts, Empirical Software Engineering, vol.24, issue.1, pp.332-380, 2019.

M. Allamanis and C. A. Sutton, Mining source code repositories at massive scale using language modeling, Proceedings of the 10th Working Conference on Mining Software Repositories, MSR '13, pp.207-216, 2013.

J. Thomas, Tim) Bergin. A history of the history of programming languages, Commun. ACM, vol.50, issue.5, pp.69-74, 2007.

M. Biazzini and B. Baudry, May the fork be with you: novel metrics to analyze collaboration on github, Proceedings of the 5th International Workshop on Emerging Trends in Software Metrics, pp.37-43, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01085400

H. Borges, A. Hora, and M. T. Valente, Understanding the factors that impact the popularity of github repositories, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp.334-344, 2016.

F. P. Brooks, The Mythical Man-Month: Essays on Software Engineering, 1978.

M. Caneill, D. M. Germán, and S. Zacchiroli, The Debsources dataset: two decades of free and open source software, Empirical Software Engineering, vol.22, issue.3, pp.1405-1437, 2017.

M. Capraro and D. Riehle, Inner source definition, benefits, and challenges, ACM Computing Surveys (CSUR), vol.49, issue.4, p.67, 2017.

K. Crowston, K. Wei, J. Howison, and A. Wiggins, Free/libre opensource software development: What we know and what we do not know, ACM Comput. Surv, vol.44, issue.2, 2008.

J. Davies, D. M. Germán, M. W. Godfrey, and A. Hindle, Software bertillonage -determining the provenance of software development artifacts, Empirical Software Engineering, vol.18, issue.6, pp.1195-1237, 2013.

R. Di, C. , and S. Zacchiroli, Software heritage: Why and how to preserve software source code, Proceedings of the 14th International Conference on Digital Preservation, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01590958

N. Sergey, J. Dorogovtsev, and . Mendes, Evolution of networks, Advances in physics, vol.51, pp.1079-1187, 2002.

R. Dyer, A. Hoan, H. Nguyen, T. Rajan, and . Nguyen, Boa: A language and infrastructure for analyzing ultra-large-scale software repositories, Proceedings of the 2013 International Conference on Software Engineering, pp.422-431, 2013.

M. Daniel, M. Germán, Y. Di-penta, G. Guéhéneuc, and . Antoniol, Code siblings: Technical and legal implications of copying code between applications, Godfrey and Whitehead, vol.21, pp.81-90

A. Gkortzis, D. Mitropoulos, and D. Spinellis, Vulinoss: a dataset of security vulnerabilities in open-source systems, vol.59, pp.18-21

M. W. Godfrey, Understanding software artifact provenance, Sci. Comput. Program, vol.97, pp.86-90, 2015.

M. W. Godfrey, D. M. German, J. Davies, and A. Hindle, Determining the provenance of software artifacts, Proceedings of the 5th International Workshop on Software Clones, IWSC '11, pp.65-66, 2011.

W. Michael, J. Godfrey, and . Whitehead, Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), 2009.

G. Gousios, M. Pinzger, and A. Van-deursen, An exploratory study of the pull-based software development model, Proceedings of the 36th International Conference on Software Engineering, pp.345-355, 2014.

G. Grieco, G. L. Grinblat, L. Uzal, S. Rawat, J. Feist et al., Toward large-scale vulnerability discovery using machine learning, Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, CODASPY '16, pp.85-96, 2016.

E. Ahmed and . Hassan, The road ahead for mining software repositories, Frontiers of Software Maintenance, pp.48-57, 2008.

L. Hatton, D. Spinellis, and M. Van-genuchten, The long-term growth rate of evolving software: Empirical results and implications, Journal of Software: Evolution and Process, vol.29, issue.5, 2017.

I. Herraiz, D. Rodríguez, G. Robles, and J. M. González-barahona, The evolution of the laws of software evolution: A discussion based on a systematic literature review, ACM Comput. Surv, vol.46, issue.2, 2013.

T. Ishio, R. G. Kula, T. Kanda, D. M. German, and K. Inoue, Software Ingredients: Detection of Third-Party Component Reuse in Java Software Release, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp.339-350, 2016.

J. Jiang, D. Lo, J. He, X. Xia, P. Singh-kochhar et al., Why and how developers fork what from whom in github, Empirical Software Engineering, vol.22, issue.1, pp.547-578, 2017.

M. M. Lehman, On understanding laws, evolution, and conservation in the largeprogram life cycle, Journal of Systems and Software, vol.1, pp.213-221, 1980.

J. Leskovec and R. Sosi?, Snap: A general-purpose network analysis and graphmining library, ACM Transactions on Intelligent Systems and Technology (TIST), vol.8, issue.1, p.1, 2016.

P. Douglas-andrew-levin, A. C. Martin-pedersen, and . Shah, Resolving license dependencies for aggregations of legally protectable content, 2009.

F. Li and V. Paxson, A large-scale empirical study of security patches, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS '17, pp.2201-2215, 2017.

C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang et al., Hitesh Sajnani, and Jan Vitek. Déjàvu: a map of code duplicates on github. PACMPL, 1(OOPSLA), vol.84, p.28, 2017.

Y. Ma, C. Bogart, S. Amreen, R. Zaretzki, and A. Mockus, World of code: an infrastructure for mining the universe of open source VCS data, vol.50, pp.143-154

V. Markovtsev and W. Long, Public git archive: a big code dataset for all, vol.59, pp.34-37

M. Martinez and M. Monperrus, Mining software repair models for reasoning on the search space of automated program fixing, Empirical Software Engineering, vol.20, issue.1, pp.176-205, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00903804

R. C. Merkle, A digital signature based on a conventional encryption function, Advances in Cryptology -CRYPTO '87, A Conference on the Theory and Applications of Cryptographic Techniques, vol.293, pp.369-378, 1987.

A. Mockus, Amassing and indexing a large sample of version control systems: Towards the census of public source code history, vol.21, pp.11-20

A. Mockus, Amassing and indexing a large sample of version control systems: Towards the census of public source code history, Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR '09, pp.11-20, 2009.

M. Newman, A. Barabasi, and D. J. Watts, The Structure and Dynamics of Networks: (Princeton Studies in Complexity), 2006.

A. Pietri, D. Spinellis, and S. Zacchiroli, The software heritage graph dataset: public software development under one roof, vol.50, pp.138-142

A. Rastogi and N. Nagappan, Forking and the sustainability of the developer community participation-an empirical investigation on outcomes and reasons, IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol.1, pp.102-111, 2016.

R. Dhavleesh-rattan, M. Bhatia, and . Singh, Software clone detection: A systematic review, Information and Software Technology, vol.55, issue.7, pp.1165-1199, 2013.

G. Rousseau and M. Biais, Computer Tool for Managing Digital Documents, 2010.

C. Kumar, R. , and J. Cordy, A survey on software clone detection research, 2007.

Y. Semura, N. Yoshida, E. Choi, and K. Inoue, Ccfindersw: Clone detection tool with flexible multilingual tokenization, 24th Asia-Pacific Software Engineering Conference, pp.654-659, 2017.

D. Spinellis, A repository of Unix history and evolution, Empirical Software Engineering, vol.22, issue.3, pp.1372-1404, 2017.

M. Squire, The lives and deaths of open source code forges, Proceedings of the 13th International Symposium on Open Collaboration, vol.15, pp.1-15, 2017.

J. Klaas, B. Stol, and . Fitzgerald, Inner source-adopting open source development practices in organizations: a tutorial, IEEE Software, vol.32, issue.4, pp.60-67, 2014.

, Proceedings of the 16th International Conference on Mining Software Repositories, MSR, 2019.

J. Svajlenko, C. Orso, and M. P. Robillard, Fast and flexible large-scale clone detection with cloneworks, Proceedings of the 39th International Conference on Software Engineering, pp.27-30, 2017.

S. Thummalapenta, L. Cerulo, L. Aversano, and M. Penta, An empirical study on the maintenance of source code clones, Empirical Software Engineering, vol.15, issue.1, pp.1-34, 2010.

F. Thung, F. Tegawende, D. Bissyande, L. Lo, and . Jiang, Network structure of social coding in github, 2013 17th European Conference on Software Maintenance and Reengineering, pp.323-326, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00790772

N. M. Tiwari, G. Upadhyaya, and H. Rajan, Candoia: a platform and ecosystem for mining software repositories tools, Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, pp.759-764, 2016.

T. Tuunanen, J. Koskinen, and T. Kärkkäinen, Automated software license analysis, Automated Software Engineering, vol.16, issue.3-4, pp.455-490, 2009.

C. Vendome, A large scale study of license usage on github, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol.2, pp.772-774, 2015.

R. Waldin and J. Zhang, Determining a document similarity metric, 2009.

Y. Wu, Y. Manabe, T. Kanda, D. M. Germán, and K. Inoue, Analysis of license inconsistency in large collections of open source projects, Empirical Software Engineering, vol.22, issue.3, pp.1194-1222, 2017.

, Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, 2018.

T. Zimmermann, R. Premraj, and A. Zeller, Predicting defects for eclipse, Predictor Models in Software Engineering, pp.9-9, 2007.

T. Zimmermann, P. Weißgerber, S. Diehl, and A. Zeller, Mining version histories to guide software changes, 26th International Conference on Software Engineering (ICSE, pp.563-572, 2004.