Predicting Text Readability with Personal Pronouns

. While the classic Readability Formula exploits word and sentence length, we aim to test whether Personal Pronouns (PPs) can be used to predict text readability with similar accuracy or not. Out of this motivation, we first calculated readability score of randomly selected texts of nine genres from the British National Corpus (BNC). Then we used Multiple Linear Regression (MLR) to determine the degree to which readability could be explained by any of the 38 individual or combinational subsets of various PPs in their orthographical forms (including I , me , we , us , you , he , him , she , her (the Objective Case), it , they and them ). Results show that (1) subsets of plural PPs can be more predicative than those of singular ones; (2) subsets of Objective forms can make better predictions than those of Subjective ones; (3) both the subsets of first-and third-person PPs show stronger predictive power than those of second-person PPs; (4) adding the article the to the subsets could only improve the prediction slightly. Reevaluation with resampled texts from BNC verify the practicality of using PPs as an alternative approach to predict text readability.


Introduction
The history of predicting textual readability quantitatively dates back to the 1940s when several linguists including Rudolf [1], George [2], Dale and Chall [3] introduced readability formulas into the field of research, thus unleashing a wave of researches and applications.Until 2017, Web of Science has published more than 11,000 researches on readability and its applications have moved from the field of education to fields of administration, commerce, computers, military, scientific research, etc. [4][5][6].
Traditional readability studies usually start with vocabulary and sentence complexity.For instance, the most widely recognized Flesch Reading Ease Formula uses word length (in terms of syllable) and sentence length (in terms of word count) as variables to calculate readability; the Dale-Chall Readability Formula exploits numbers of words that are not in the Dale-Chall 3000 Vocabulary and sentence length as criteria for predicting readability; the Gunning Fog Formula [7] and the SMOG Formula [8] employ number of polysyllabic words and sentence length as measures of readability.
As computer technologies improve, many other factors are taken into account, such as type-token ratio, numbers of affixes, prepositional phrases and clauses, cohesive ties, other linguistics features [9], and even L2 learner's reading experience, etc. [10].While these studies are valuable and significant, they usually involve multiple indirect indices that are subjectively defined or difficult to calculate in large-scale analysis.For example, it is hard to tell whether a word such as factory with two or more phonetic variants should be counted as 2 syllables (/'faektrɪ/) or 3 syllables (/'faektəri/).Besides, most of the classic formulae target for texts in English (and some other syllabic language), their applicability for non-syllabic languages such as Chinese remain untested.
In this research, we hope to test whether Personal Pronouns (hereinafter referred to as PPs) alone can have any predictive power for readability or not.There are several reasons for us to try them: (1) Given that PPs are always monosyllabic words used to replace full personal names or noun phrases, their usage in a text would affect its total word number, average sentence length as well as average word length; (2) PPs are often anaphorically used and can thus serve as cohesive ties to reduce redundancy and improve comprehension; (3) PPs were only tested collectively in [11] and [12] as part of linguistic features or cohesive ties, and consequently reached different conclusions on the role PPs play in readability prediction.
Since most languages have pronouns, we therefore propose that PPs could be promising candidate indicators of readability across languages and deserve further investigation.In this study, we will use a corpus-based approach to test the utility of individual PP forms in English texts of different genres.Specific research questions are as follows: (1) Which person (first-, second-, or third-person, hereinafter referred to as 1P, 2P and 3P respectively) of PPs can predict text readability most accurately?
(2) Which number (Singular and Plural) of PPs can predict text readability more accurately?
(3) Which case (Subjective and Objective, with Possessive temporarily excluded) of PPs can predict text readability more accurately?Section 2 and Section 3 will introduce our research methods and data processing, Section 4 will report the data results from 5 aspects, Section 5 will reevaluate the results and Section 6 will summarize our major findings and limitations.

Materials and Methodology
This research uses corpus-based method and examines the predictability of various subsets of the PP forms (as shown in

Corpus data
British National Corpus (BNC) was chosen as our research object for the following reasons: (1) All text materials in BNC were collected from native speakers as representative samples of Standard British English.So errors in pronoun use by non-native speakers have been excluded to a large extent; variations in geographical and social dialects should have been reasonably controlled or avoided as well.
(2) BNC contains approximately 100 million words, 90% of which are written materials collected from nine domains (also referred to as "genres" hereinafter) namely: of genres on usages of PPs [13], proper sampling of this balanced general corpus allows for control over the genre variable that may affect readability.
Text materials used in this study (Corpus I) consist of 1,091,347 words in total, which are randomly selected from each of the nine domains.Corpus II consists of 972,490 words in total.

Readability Formula
In the present study, we choose the Flesch Reading Ease Score, which is recognized as the most widely used and the most tested and reliable formula [6]

Data Processing
Data processing are divided into 4 steps: (1) Use Perl program to count word and sentence length; (2) Calculate the Flesch Reading Ease scores of sample texts of nine genres respectively; (3) Use AntConc to count numbers of PP forms.Tokens of US as the abbreviation of the United States and tokens of the Possessive her are excluded during the retrieval.
After that, the densities of the individual pronouns (D(I), D(we), etc.) based on the total word number of each text domain are calculated respectively; (4) Use SPSS for multivariate regression analysis.Take the density of each subset of PPs as an independent variable, and the Flesch Reading Ease score as the dependent variable.Use Sig., correlation coefficient (R 2 ), as well as the adjusted correlation coefficient (adjusted R 2 ) values to determine which subset(s) of PPs may have better predictability.The criteria and process for determining moderate and strong fitting subsets are shown in Fig. 1.  Results in Fig. 2 show that the 3P group has the best fitting degrees, with 5 subests (over 40%) of strong fitting and 2 (nearly 10%) of medium fitting subsets.The mixed (1P+3P) group performs similarly well, with 3 subsets (nearly 30%) of strong fitting and another 2 (nearly 10%) of good fitting subsets, way better than 1P and 2P subsets do.Therefore, it can be concluded that 3P subsets perform better than 1P and 2P subsets do in both individual and mixed subsets, which means that adding 1P and 2P subsets into the 3P subsets will lowered their predictability.First, we use D(the) to predict text readability and gain a medium performance (Sig.=0.019,R 2 =0.570,Adjusted R 2 =0.509).Results in Fig. 5 show that subsets with the included perform slightly better than those without the in good and in strong fitting ranges.To test whether there is a significant difference while adding the in PPs, we use chi-square tests and draw the conclusion that the improvement is not significant (Chisquare value=0.213,df=2, p=0.899>0.05).

Reevaluation for Strong Fitting Subsets
All the subsets with a strong fitting degree are shown in Table 4.To explore whether subsets with strong predicting power can perform consistently, we repeated the procedures in Section 3 with re-sampled texts from BNC (Corpus II) and recalculated the pronoun and readability data in the new corpus.results from both Corpus I and II are shown in Table 4.
Table 4 shows that there are still two subsets with strong fitting degree in Corpus II, namely "he + him + she + her + it" and "I + me + he + him + she + her + it".
Although the other subsets have some changes in the fitting degree, they are almost in the moderate fitting range, indicating fair predictability.

Conclusion
A corpus-based approach is used in research to explore the readability predictability of predictability.Therefore, we believe that using specific subsets of PPs to predict text readability appears practical.
However, large-scale tests are needed before any solid conclusion can be drawn concerning the applicability of PPs for readability prediction.Detailed investigation into the predictability of Possessive PPs, and it in Subjective and Objective Cases may be needed as well.Besides, it needs to be verified on whether texts in other geographical varieties such as American English are similar to their British matches.

Fig. 2 .
Fig. 2. Results for predictability of different Persons on readability in Corpus I Number and Readability.The 38 individual and combinational subsets of PPs can be divided into three groups according to Number (singular PPs: 12 subsets, plural PPs: 9 subsets, singular + plural PPs: 17 subsets).

Fig. 3
Fig. 3 shows that 50% of the singular-Number group offer good predication (with strong and/or medium fitness); and nearly 45% (11.1%+33.3%) of the plural-Number group show good prediction.The mixed-number group performs not as well.

Fig. 3 .
Fig. 3. Results for predictability of different Numbers on readability in Corpus I Case and Readability.The 38 individual and combinational subsets of PPs can be divided into three groups according to Case (Subjective PPs: 9 subsets; Objective PPs: 9 subsets; Subjective + Objective PPs: 20 subsets).

Fig. 4
Fig.4shows that Objective PP group has much stronger predictability than the Subjective group and the mixed-Case group, in both good and strong fitting area.

Fig. 4 .
Fig. 4. Results for predictability of different Cases on readability in Corpus I

Fig. 5 .
Fig. 5. Results for predictability of including and excluding the on readability 77 subsets with various personal pronoun forms and the definite article the.The results show that: (1) them has the best predictive power among individual pronoun forms; (2) 3P and 1P make better predictions than 2P; (3) plural PPs outperforms singular ones only in strong fitting range; (4) Objective PPs can predict more accurately than Subjective ones; (5) definite article the may only improve subsets' predictability slightly; (6) Retesting results are consistent for those PP subsets with good

Table 1
) on text readability in terms of Person, Number and Case.It should be noted that the Possessive Case is not taken into consideration in this research.Nor will this paper look into the gender issue.So (he+she) and (him+her) will be considered as individual Subjective and Objective singular forms of 3P+HUMAN respectively; it be considered as the individual singular form of 3P-HUMAN with unclear Case; and you as the only 2P form with unclear Number and Case.Consequently, there are 38 reasonable subsets of PP forms: 10 subsets with only individual PP forms, and 28 others with various Person/Number/Case combinations.

Table 1 .
Personal pronoun forms studied in this project

Table 2
shows that texts from Belief, Arts and Imagination domains are easiest to understand with highest readability scores among all texts from the nine domains; texts of Commerce, Natural Science, Applied Science and World Affairs are most difficult to read with lowest scores.

Table 2 .
Readability results for nine domains in BNC

Table 4 .
Personal pronoun subsets with strong fitness in Corpus I and II