Text Data Mining of English Interviews

. An “interview” is the technique to gain the particular data effectively which the interviewers want to know through the conversation. In this paper, we metrically analyzed some English interviews: Larry King Live on CNN, and compared these with English news ( CNN Live Today ) and the inaugural addresses of the three U.S. Presidents. In short, frequency characteristics of character-and word-appearance were investigated using a program written in C++. These characteristics were approximated by an exponential function. Furthermore, we calculated the percentage of American basic vocabulary to obtain the difficulty-level as well as the K -characteristic of each material.


Introduction
Human beings are always talking with other people. We are getting information from others as an everyday experience, using many effective arts in order to obtain a cooperative response. An "interview" is more specific way of talking, and it is the technique to gain the particular data effectively which the interviewers want to know through the conversation [1].

Method of Analysis and Materials
The materials analyzed here are as follows: Larry King Live (Jan. 21, 2004-July 13, 2004; 20 materials in total) Larry King Live is one of the CNN's highest-rated shows and Mr. King is regarded as the first American talk show host to have a worldwide audience. He was born at Brooklyn in New York on November 19 in 1933, and educated at the Lafayette High School [2]. We selected 20 interviews, and analyzed interviewer's English, that is, the utterances of Mr. King. For reference, the interviewees' data are shown in Table 1. Thus, while the interviewees are male in Materials 1 to 10, they are female in Materials 11 to 20. For comparison, we analyzed 20 English news materials from CNN Live Today aired on January 2-31 in 2003, as well as the inaugural addresses of the three U.S. Presidents: George Bush (Jan. 20, 1989), William J. Clinton (Jan. 21, 1993), and George W. Bush (Jan. 20, 2001).
The computer program for this analysis is composed of C++. Besides the characteristics of character-and word-appearance for each piece of material, various information such as the "number of sentences," the "number of paragraphs," the "mean word length," the "number of words per sentence," etc. can be extracted by this program [3].

Characteristics of Character-appearance
First, the most frequently used characters in each material and their frequency were derived. Then, the frequencies of the 50 most frequently used characters including capitals, small letters, and punctuations were plotted on a descending scale. The vertical shaft shows the degree of the frequency and the horizontal shaft shows the order of character-appearance. The vertical shaft is scaled with a logarithm. As an example, the result of Material 1 is shown in Fig. 1.

Fig. 1. Frequency characteristics of character-appearance in Larry King Live.
There is an inflection point caused by the difference of the degree of decrease between the 13th and the 14th ranked characters, and the degree of decrease gets a little higher after the 26th character. This characteristic curve was approximated by the following exponential function: From this function, we are able to derive coefficients c and b [4]. In the case of Material 1, c is 11.181 and b is 0.1086. The distribution of coefficients c and b extracted from each material is shown in Fig. 2.

Characters
There is a linear relationship between c and b for all of the 43 materials. Previously, we analyzed various English writings and reported that there is a positive correlation between the coefficients c and b, and that the more journalistic the material is, the lower the values of c and b are, and the more literary, the higher the values of c and b [5]. The values of coefficients c and b for interviews are low: the value of c ranges from 8.0567 (Material 5) to 11.605 (Material 11), and that of b is 0.0848 to 0.1099, compared to the case of the CNN news (c is 10.009 to 13.548, b is 0.1039 to 0.1279) and inaugural addresses (c is 13.484 to 15.461, b is 0.1309 to 0.1434). Thus, while the interviews have a similar tendency to journalism, the inaugural addresses are similar to literary writings.

Characteristics of Character-appearance
Next, the 20 most frequently used words in some of the materials are shown in Table  2. Table 2. High-frequency words for each material.
The definite article THE, the personal pronouns YOU and I, and auxiliary DO (DID) are often used in interviews. In addition, interrogatives such as WHAT and WHO are also used frequently in Materials 11 and 13. As for personal pronoun YOU, it ranks as the most frequently used word in the 8 interviews in which the interviewee was female, except for Materials 14 and 19, in which YOU ranks the 2nd. Thus, personal pronoun YOU tends to be more often used, when the interviewee is female. For interviews and CNN news, some content words such as PRESIDENT and POLICE are ranked high, because the number of words for each material is not so many. Just as in the case of characters, the frequencies of the 50 most frequently used words in each material were plotted. Each characteristic curve was approximated by the same exponential function: [y = c*exp (-bx)]. The distribution of c and b is shown in Fig. 3. As a method of featuring words used in writing, a statistician named Udny Yule suggested an index called the "K-characteristic" in 1944 [6]. This can express the richness of vocabulary in writings by measuring the probability of any randomly selected pair of words being identical. He tried to identify the author of The Imitation of Christ using this index. This K-characteristic is defined as follows: where if there are f i words used x i times in a writing, Words order and the interval of values. We would like to investigate the relationship between K-characteristic and the coefficients for word-appearance in the future.

Degree of Difficulty
In order to show how difficult the materials for listeners are, we derived the degree of difficulty for each material through the variety of words and their frequency [7]. That is, we came up with two parameters to measure difficulty; one is for word-type or word-sort (D ws ), and the other is for the frequency or the number of words (D wn ). The equation for each parameter is as follows:  The closer the value is to 1, the more difficult the material. As for the degree of word-sort (D ws ), when we analyzed the English textbooks in Japanese junior and senior high schools, the difficulty increases as the grades go up. Thus, the validity of using the variety of words and their frequency of the American basic vocabulary as the parameters to extract the difficulty was accepted [7]. According to Fig. 5, the difficulty of interviews ranges from 0.722 (Material 2) to 0.782 (Material 6), which is almost identical with the half of the news materials. The difficulties of the three inaugural addresses are high: 0.782 to 0.808. The most difficult interview (Material 6) is almost equal to the easiest of the inaugural address. As for D wn , because the most frequently used words in each material, that is, THE, OF, TO, AND, IN, A, etc., are common in every material, and the characteristics of word-appearance are also similar among them, the range of values for D wn is assumed to be tight.
Thus, we calculated the values of both D ws and D wn to show how difficult the materials are for listeners, and to show which level of English the materials are compared with others. In order to make the judgments of difficulty easier for the general public, we derived one difficulty parameter from D ws and D wn using the following principal component analysis: where a 1 and a 2 are the weights used to combine D ws and D wn . Using the variancecovariance matrix, the 1st principal component z was extracted: z = 0.349 * D ws + 0.9374 * D wn , from which we calculated the principal component scores. The results are shown in Fig. 6.

Other Characteristics
Other metrical characteristics of each material were compared. The results of the "mean word length," the "number of words per sentence," etc. are shown together in Table 3. Table 3. Metrical data for each material.
Although we counted the "frequency of relatives," the "frequency of modal auxiliaries," etc., some of the words counted might be used as other parts of speech because we didn't check the meaning of each word. Additionally, the results of the "mean word length" and the "number of words per sentence" for each material are shown in Fig. 7 and Fig. 8 respectively.   Mean Word Length. As for the "mean word length," it is 5.129 (Material 5) to 5.546 letters (Material 8) for Materials 1 to 10, and 5.249 (Material 20) to 5.562 letters (Material 13) for Materials 11 to 20, which are low, compared with the CNN news and inaugural addresses. As much as 13 materials of the 20 CNN news materials are longer than interviews. Moreover, 4 interviews in which the interviewee was male are shorter than the interviews in which the interviewee was female. Thus, we can see that when the interviewee is male, the male interviewer tends to use shortlength words. . In this case, as much as 12 materials of the 20 CNN news materials are longer than Material 20. Also from this point of view, the interview materials seem to be easier to listen than the CNN news and inaugural addresses.

Frequency of Auxiliaries.
We also examined the "frequency of auxiliaries." There are two kinds of auxiliaries in a broad sense. One expresses the tense and voice, such as BE which makes up the progressive form and the passive form, the perfect tense HAVE, and DO in interrogative sentences or negative sentences. The other is a modal auxiliary, such as WILL or CAN which expresses the mood or attitude of the speaker [8]. In this study, we targeted only modal auxiliaries. As for the result, the "frequency of auxiliaries" is highest in the inaugural address, the average of the 3 materials is 2.261%, and lowest in interviews, the average of Materials 11 to 20 is 0.915%. As for Materials 1 to 10, it is 0.922%. Therefore, it might be said that while the President tends to communicate his subtle thoughts and feelings with auxiliary verbs, the style of Larry King's talking can be called more assertive.
Frequency of Personal Pronouns. As for the "frequency of personal pronouns," it is as high as 13.395% and 14.045% for Materials 1 to 10 and Materials 11 to 20 respectively. This is because the frequencies of YOU and I are rather high in the interviews, as was mentioned before.
Word-length Distribution of Nouns, Verbs, Adjectives, and Adverbs. We also examined word-length distribution of "nouns," "verbs," "adjectives," and "adverbs." As examples, the results of Nouns and Adverbs are shown in Fig. 9 and Fig. 10 respectively. Judging from Fig. 9, we can see a tendency that in the case of Nouns, shorter words are used in the interviews, compared with the inaugural address. On the other hand, as for the case of Adverbs, the frequency of 4-letter words is rather high in the interview materials. It is as much as 48.837% in Material 1.

Positioning of Each Material
We tried to make positioning all of the 43 materials, doing a principal component analysis of the educed data by the correlation procession. The results are shown in Fig. 11. We could assume that while the first principal component expresses whether an utterance was turned to the public or to an individual, the second principal component defines whether an utterance is broadcast English or speech style English.

Conclusions
We investigated some characteristics of character-and word-appearance of interviews: Larry King Live on CNN, comparing these with English news and the inaugural addresses of the U.S. Presidents. In this analysis, we used an approximate equation of an exponential function to educe the characteristics of each material using coefficients c and b of the equation. Moreover, we calculated the percentage of American basic vocabulary to obtain the difficulty-level as well as the Kcharacteristic. As a result, it was clearly shown that the interviews have the same tendency as English journalism in character-appearance. Moreover, we could show quantitatively that the interviews are a little easier to listen than the CNN news.
In the future, we plan to apply these results to education. For example, we would like to measure the effectiveness of teaching some characteristics of English materials before listening or reading them.