, The published prose is edited and proofread. So one might expect developers to use much more words (some real, most -not) than English books. And yet, this is not the case. The 151,717 methods in our dataset contain almost 3 million words. This code was written by hundreds of different developers. And all of them managed to use only 8,211 unique words or 5,693 unique stems. Whats even more interesting is that 5,480 of those words are recognised as valid English words. Notice that software that we used for identifying English words has recognised just 72% of unique words from Brown and Guttenberg corpora (of actual printed English texts). Which means that its knowledge of English is very limited. And yet, this tool recognised around 67% of unique letter sequences from Pharo's source code as valid English words, Despite having more words in total, Pharo corpus has much less unique words than both natural English corpora. Both Brown and Gutenberg corpora use around 40,000 unique words and identifier names of Pharo use only 8,211 unique words

, Identifier names are created using very simplistic, limited, and highly repetitive vocabulary

, Interesting Findings Here we summarize some interesting facts about the source code that we have discovered during our analysis

, They have 469, 403, and 287 classes respectively. However, these are the extreme cases, most packages are much smaller, ? Three largest packages in our dataset are Bloc, Roassal2, and Iceberg-TipUI

, ? Out of 13,935 classes in our dataset, 25% have just 3 methods and 50% of classes have no more than 6 methods

, 717 methods from the internal projects in our dataset (excluding data methods), the average number of lines of code in Pharo methods is 5.8. However, this number is not representative because only 30% of methods have 6 or more lines. The distribution is right-skewed, so it is better to look at the median, vol.151

, ? DataFrame has the highest proportion of test methods. Almost 56% of its methods are tests

, About a quarter of source code is taken by message sends. In source code of methods, message sends (method names) take 27.3% of tokens and 26

, 5% of source code are string literals and 19.4% are literal arrays. Together literals take 44% of characters in source code, vol.22

. Balmas, Software metric for java and c++ practices (squale deliverable 1.1), 2009.

A. Bergel-;-bergel, Agile Visualization, 2016.

. Bergel, Classbox/J: Controlling the scope of change in Java, Proceedings of 20th International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'05), pp.177-189, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00533461

, Pharo by Example. Square Bracket Associates, 2009.

P. Bunge-;-bunge, Scripting browsers with Glamour. Master's thesis, 2009.

S. R. Chidamber and C. F. Kemerer, A metrics suite for object oriented design, IEEE Transactions on Software Engineering, vol.20, issue.6, pp.476-493, 1994.

P. Deißenböck, F. Deißenböck, and M. Pizska, Concise and consistent naming, International Workshop on Program Comprehension (IWPC 2005), pp.97-106, 2005.

[. Demeyer, Object-Oriented Reengineering Patterns, 2002.

[. Gamma, Design Patterns: Elements of Reusable Object-Oriented Software, 1995.

[. Gamma, Design patterns: Abstraction and reuse of object-oriented design, Proceedings ECOOP '93, vol.707, pp.406-431, 1993.

A. Goldberg, Smalltalk 80: the Interactive Programming Environment, 1984.

R. Goldberg, A. Goldberg, and D. Robson, Smalltalk 80: the Language and its Implementation, 1983.

R. Goldberg, A. Goldberg, and D. Robson, Smalltalk-80: The Language, 1989.

D. H. Ingalls, Design principles behind Smalltalk, Byte, vol.6, issue.8, pp.286-298, 1981.

A. C. Kay-;-kay, A personal computer for children of all ages, Proceedings of the ACM National Conference, 1972.

A. C. Kay-;-kay, Microelectronics and the personal computer, Scientific American, vol.3, issue.237, pp.230-240, 1977.

A. C. Kay-;-kay, The early history of Smalltalk, ACM SIGPLAN Notices, vol.28, pp.69-95, 1993.

F. ;. Kucera, H. Kucera, and W. Francis, A standard corpus of present-day edited american english, for use with digital computers, 1967.

S. Lahiri-;-lahiri, Complexity of word collocation networks: A preliminary structural analysis, Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp.96-105, 2014.

R. C. Martin-;-martin, Clean code: a handbook of agile software craftsmanship. Pearson Education, 2009.

[. Polito, Scoped extension methods in dynamically-typed languages, The Art, Science, and Engineering of Programming, vol.2, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01609310