Clinical Text Mining for Context Sequences Identification

. This paper presents an approach based on sequence mining for identiﬁcation of context models of diseases described by diﬀerent medical specialists in clinical text. Clinical narratives contain rich medical terminology, speciﬁc abbreviations, and various numerical values. Usually raw clinical texts contain too many typos. Due to the telegraphic style of the text and incomplete sentences, the general part of speech tag-gers and syntax parsers are not eﬃcient in text processing of non-English clinical text. The proposed approach is language independent. Thus, the method is suitable for processing clinical texts in low resource languages. The experiments are done on pseudonimized outpatient records in Bulgarian language produced by four diﬀerent specialists for the same cohort of patients suﬀering from similar disorders. The results show that from the clinical documents can be identiﬁed the specialty of the physician. Even the close vocabulary is used in the patient status description there are slight diﬀerences in the language used by diﬀerent physicians. The depth and the details of the description allow to determine diﬀerent aspects and to identify the focus in the text. The proposed data driven approach will help for automatic clinical text classiﬁcation depending on the specialty of the physician who wrote the document. The experimental results show high precision and recall in classiﬁcation task for all classes of specialist represented in the dataset. The comparison of the proposed method with bag of words method show some improvement of the results in document classiﬁcation task.


Motivation
Healthcare is data intensive domain. Large amount of patient data are generated on daily base. However, more than 80% of this information is stored in non structured format -as clinical texts. Usually clinical narratives contain description with sentences in telegraphic style, non-unified abbreviation, many typos, lack of punctuation, concatenated words, etc. It is not straightforward how patient data can be extracted in structured format from such messy data. Natural language processing (NLP) of non-English clinical text is quite challenging task due to lack of resources and NLP tools [11]. There are still non existing translations of SNOMED 1 , Medical Subject Headings (MeSH) 2 and Unified Medical Language System (UMLS) 3 for the majority of languages.
Clinical texts contain complex descriptions of events. Investigating the cumulative result of all events over the patient status require more detailed study of different ways of their description. All physicians use common vocabulary and terminology to describe organs and systems during the human body observation but tend to use different description depending on their specialty. Analyzing complex relations between clinical events will help to prove different hypothesis in healthcare and automatically to generate context models for patient status associated to diagnoses. This is very important in epidemiology and will help monitoring some chronic diseases' complications on different stages of their development. The chronic disease with highest prevalence are cardiovascular diseases, cancer, chronic respiratory diseases and diabetes 4 . The complications of these chronic diseases develop over time and they are with high socioeconomic impact and the main reason for over than 70% of mortality cases. In this paper are presented some results for processing data of patients with Diabetes Mellitus type 2 (T2DM), Schizophrenia (SCH) and Chronic Obstructive Pulmonary Disease (COPD).
We show that data mining and text mining are efficient techniques for identification of complex relations in clinical text.
The main goal of this research is to examine differences and specificity in patient status description produced by different medical specialists. The proposed data-driven approach is used for automatic generation of context models for patient status associated with some chronic diseases. The approach is language independent. An application of the context sequences in used for clinical text classification depending on the specialty of the physician who wrote the document.
The paper is structured as follows: Section 2 briefly overviews the research in the area; Section 3 describes the data collections of clinical text used in the experiments; Section 4 presents the theoretical background and formal presentation of the problem; Section 5 describes in details the proposed data mining method for context models generation from clinical text; Section 6 shows experimental results and discusses the method application in clinical texts classification; Section 7 contains the conclusion and sketches some plans for future work.

Related Work
Data mining methods are widely used in clinical data analyses both for structured data and free text [17]. There are two types of frequent patterns mining frequent itemsets patterns mining (FPM) and frequent sequence mining (FSM).
In the first approach the order of items does not matter, and in the second one the order does matter.
In context modeling task there is some research for other domains. Ziembiski [19] proposes a method that initially generates context models from small collections of data and later summarizes them in more general models. Rabatel et al [14] describes a method for mining sequential patterns in marketing domain taking into account not only the transactions that have been made but also various attributes associated with customers, like age, gender and etc. They initially uses classical data mining method for structured data and later is added context information exploring the attributes with hierarchical organization.
Context models in FPM are usually based on some ontologies. Huang et al [7] present two semantics-driven FPM algorithms for adverse drug effects prevention and prediction by processing Electronic Health Records (EHR) The first algorithm is based on EHR domain ontologies and semantic data annotation with metadata. The second algorithm uses semantic hypergraph-based k-itemset generation. Jensen et al [8] describe a method for free text in Electronic Health Records (EHR) processing in Norwegian language. They are using NOR-MeSH for estimation of disease trajectories of the cancer patients.
One of the major problems with clinical data repositories is that they contain in-complete data about the patient history. Another problem is that the raw data are too noisy and needs significant efforts for preprocessing and cleaning. The timestamps of the events are uncertain, because the physicians dont know the exact occurrence time of some events. There can be a significant gap between the onset of some dis-eases and the first record for diagnosis in EHR made by the physician. Thus a FPM method for dealing with temporal uncertainty was proposed by Ge et al [4]. It is hard to select representative small collections of clinical narratives, because there is a huge diversity of patient status descriptions. Some approaches use frequent patterns mining (FPM) considering the text as bag-of-words and losing all grammatical information.
The majority of FSM and FPM applications in Health informatics are for patterns identification in structured data. Wright et al [16] present a method for prediction of the next prescribed drug in patient treatment. They use CSPADE algorithm for FSM of diabetes medication prescriptions. Patniak et al [12] present mining system called EMRView for identifying and visualizing partial order information from EHR, more particularly ICD-10 codes.
But there are also applications of FSM for textual data. Plantevit et al [13] present a method for FSM for Biomedical named entity recognition task.
There are developed a variety of techniques for FPM and FSM task solution. Some of them are temporal abstraction approach for medical temporal patterns discovery, one-sided constitutional nonnegative matrix factorization, and symbolic aggregate approximation [15].
Healthcare is considered as data-intensive domain and as such faces the challenges of big data processing problems. Krumholz [10] discusses the potential and importance of harnessing big data in healthcare for prediction, prevention and improvement of healthcare decision making.
In the classification task there are used successfully many artificial intelligence (AI) approaches [9] with high accuracy: neural networks, naive Baise classifiers, support vector machines, etc. The main reason for choosing FSM method is than in healthcare data processing the most important feature of the used method is the result to be explainable, e.i. so called "Explainable AI" [5]. This will make the decision making process more transparent.

Materials
For experiments is used a data collections of outpatient records (ORs) from Bulgarian National Diabetes Register [2].
They are generated from a data repository of about 262 million pseudonimized outpatient records (ORs) submitted to the Bulgarian National Health Insurance Fund (NHIF) in period 2010-2016 for more than 5 million citizens yearly. The NHIF collects for reimbursement purpose all ORs produced by General Practitioners and the Specialists from Ambulatory Care for every patient clinical visit. The NHIF collects for reimbursement purpose all ORs produced by General Practitioners and the Specialists from Ambulatory Care for every patient clinical visit. The collections used for experiments contain ORs produced by the following specialists: Otolaryngology (S14), Pulmology (S19), Endocrinology (S05), and General Practitioners (S00).
ORs are stored in the repository as semi-structured files with predefined XML-format. Structured information describe the necessary data for health management like visit date and time; pseudonimized personal data and visit-related information, demographic data (age, gender, and demographic region), etc. All diagnoses are presented by ICD-10 5 codes and the name according to the standard nomenclature. The most important information concerning patient status and case history is provided like free text.
For all experiments are used raw ORs, without any preprocessing due to the lack of resources and annotated corpora. The text style for unstructured information is telegraphic. Usually with no punctuation and a lot of noise (some words are concatenated; there are many typos, syntax errors, etc.). The Bulgarian ORs contain medical terminology both in Latin and Bulgarian. Some of the Latin terminology is also used with Cyrillic transcription.
The most important information concerning patient status and case history is provided like free text. ORs contain paragraphs of unstructured text provided as separate XML tags (see Table 1  Let each sentence in a clinical text e 1 is splitted on a sequence of tokens The length of a sequence v is m (the number of tokens), denoted len(v) = m. We denote ∅ the empty sequence (with length zero, i. e. len(∅) = m).
Let D ⊆ P × E be the set of all sequences in collection in the format pid, sequence . We will call D database.
Let p = p 1 , p 2 , . . . , p m and q = q 1 , q 2 , . . . , q t be two sequences over W . We say that q is subsequence of p denoted by q ⊆ p, if there exists one-to-one mapping: θ: [1, t] → [1, m], such that q i = p θ(i) and for any two positions i, and j in q, i < j ⇒ θ(i) < θ(j).
Each sequential pattern is a sequence. A sequence A = X 1 , X 2 , . . . , X m , where X 1 , X 2 , . . . , X m are itemsets is said to occur in another sequence B = Y 1 , Y 2 , . . . , Y t , where Y 1 , Y 2 , . . . , Y t are itemsets, if and only if there exist integers Let D is a database and Q ⊆ E is a sequential pattern. The support of a sequential pattern Q, denoted support(Q) is the number of sequences where the pattern occurs divided by the total number of sequences in the database.
We define minimal support threshold minsup -a real number in the range [0,1]. A frequent sequential pattern is a sequential pattern having a support no less than minsup.
In our task we are looking only for frequent sequential pattern for given minsup.

Method
Initially we generate collections S 1 , S 2 , . . . , S r of ORs from the repository, using the structured information data for specialists who wrote them. We define vocabularies W 1 , W 2 , . . . , W r .
The collections processing is organized as pipeline (see Fig.1). The first step is to split each collection on two subsets -one that contain only Anamnesis for patients (SA i ) and the other SH i -for their Status. Each of these subsets will be processed independently. We define for all collections vocabularies W A 1 , W A 2 , . . . , W A r and W H 1 , W H 2 , . . . , W H r for each or these subsets correspondingly. The next step converts free text from ORs into database (see Fig.2). After tokenization is applied stemming. All stop words are replaced by terminal symbol STOP. ORs contain many numerical values, like clinical test results, vitals (Body Mass Index, Riva Roci -blood pressure), etc. Numerical values are replaced by terminal symbol NUM. The sentences have mainly telegraphic style, or the information is described as sequence of phrases separated by semicolon. We consider those phrases as sentences. Sentence splitting is applied to construct sequences of itemsets for each document. In this process all additional punctuation is removed. To separate the sentences is used negative number -1, and -2 is used to denote the end of the text. The last stage of the preprocessing is hashing, which purpose is to speed-up the process of frequent sequence mining. In the hashing phase each word is replaced by unique numerical ID.
For frequent sequence mining is used algorithm CM-SPAM [3], more efficient variation of SPAM algorithm [1], that is considered as one of the fastest algorithms for sequential mining. CM-SPAM is even faster than SPAM, but more important is that CM-SPAM is more efficient for low values of minsup. This is important, because in clinical text some cases are not so frequent, because the prevalence of the diseases is usually lower in comparison with other domains. The minsup values for clinical data are usually in the range [0.01,0.1]. The last step is the postprocessing phase (see Fig.3) that starts with replacing back the hashed words. Then we identify unique vocabulary for each collection: . . , F A r and F H 1 , F H 2 , . . . , F H r are the frequent sequences generated on step 3. We need to filter all sequences that occur in any sequence of the other sets or a frequent sequence from other collection occur in them.
The so filtered frequent sequences sets together with unique words form the specific terminology and sub-language used by different specialist in patient disease history and status description.

Experiments and Results
For a cohort of 300 patients suffering from T2DM and COPD are extracted ORs for all their clinical visits in 3 year period (2012-2014) to different specialists: Otolaryngology (S14), Pulmology (S19), Endocrinology (S05), and General Practitioners (S00). After preprocessing of ORs in all collections are separately extracted Anamnesis and Status descriptions for each patient ( Table 2 and Table  3).
The minsup value were set as relative minsup function of the ration between the number of patients and ORs. It is approximately 0.02% for the smallest set SA14, 0.03% for SA05 and SA19 and 0.1% for the largest set SA00. This is a rather small minsup value that will guarantee coverage even for more rare cases but with sufficient support. For Status subset the minsup value were set in similar range -0.05% for SH14 and SH19, 0.08% for SH05 and 0.09% for SH00.
All subsets are processed with CM-SPAM for frequent sequences mining. In addition the algorithm dEclat [18] for frequent itemsets mining was applied. The frequent itemsets were filtered with similar method as frequent sequences (see Table 2 and Table 3). For experiments are used Java implementations of the algorithms from SPFM (Open-Source Data Mining Library) 6  The datasets for Anamnesis are sparse, because they contain descriptions of different patient diseases history, complaints, and risk factors. Thus the diversity of explanations causes lower number of generated frequent sequences and higher number of unique vocabulary (see Fig.4 and Table 2). The unique vocabulary contain different complaints and many informal words for their explanation. Although the set SA00 is larger than the other sets for this set are generated lower number of frequent sequences. This set corresponds to the ORs written by general practitioners, who usually ob-serve larger set of diseases than other specialists. The set SA05 contains more consistent information about the T2DM complaints only.  In contrast the datasets for Status are dense, because they contain predefined set of organs and systems status description. The Status explanation usually contains phrases rather than sentences. Each phrase describes single organ/system and its current condition. The similarity between Status explanations causes significant growth of the number of generated frequent sequences and lower number of unique vocabulary (see Fig.5 and Table 3). Although the higher number of the generated frequent sequences during the filtering process they shrink faster, because contain similar subsequences. The unique vocabulary contains specific terminology for some organs and systems that are in main focus and interest for the physician that makes the medical examination. The result set of context sequence contain only specific sub-language used from specialists in their area.
The extracted frequent sequences and frequent itemsets are used for multi class text classification. Experiments are provided by non-exhaustive cross-validation (5 iterations on sets in ratio 7:1 training to test). For comparison of the obtained results is used bag of words (BOW) method by applying frequent itemsests generated by dEclat algorithm.
The classification is based on unique vocabulary used for classes and on the filtered sequences and frequent itemsets from all classes that match the text. As golden standard in the evaluation are used specialty codes from ORs structured data.
Six types of experiments are performed. In the first task are used subsets for Anamnesis section for all four specialty classes 00, 05, 14 and 19. The evaluation results (Table 4) for F1 measure (F 1 = 2 * P recission * Recall/(P recission + Recall)) show that context sequences method outperforms BOW method for all classes, except class 19 for Anamnesis subsets. The evaluation for Status section classification is just the opposite (Table 6). BOW method shows better results than context sequences. The main reason is that Status section is written in telegraphic style with phrases rather than full sentences. Usually Status section contains sequence of attribute-value (A-V ) pairs -anatomical organ/system and its status/condition. General practitioners used in ORs terminology and phrases that can be found in ORs for all specialties. Thus the class 00 is not disjoint with classes 05, 14 and 19. Class 00 is one of the main reasons for misclassification. Another experiment was performed with "pure" classes -including only 05, 14, and 19 (  Table 8. Evaluation of rules for ORs for S00, S05, S14 and S19 Context Sequences BOW S05 S14 S19 S05 S14 S19 Precision 1.0000 1.0000 0.9981 0.9645 1.0000 1.0000 Recall 0.9987 1.0000 1.0000 1.0000 0.9872 0.9492 F1 0.9994 1.0000 0.9991 0.9819 0.9935 0.9739 Table 9. Evaluation of rules for ORs for S05, S14 and S19 Finally the classification of both sections -Anamnesis and Status is used for classification of the outpatient record as a whole document. The evaluation results (Table 8) show that results for context sequences drop down and BOW method performance is better. After eliminating the noisy set S00 -the result ( Table 9) for context sequences method significantly improve and outperform BOW method for all three classes 05, 14 and 19.

Conclusion and Further Work
The proposed data-driven method is based on data mining techniques for context sequences identification in clinical text depending on medical specialty of the doctor. The method is language independent and can be used for low resource languages. The huge number of generated frequent sequences is reduces during the filtering process. The experimental results show that context sequences methods outperforms BOW method for sparse datasets in classification task.
Using "human-in-the-loop" [6] approach some further analyses of the significance for the domain of the generated frequent sequences and the misclassified documents will be beneficial. The space of clinical events is too complex. Thus "human-in-the-loop" can be applied also for subclustering task by using patient age, gender and demographic information. Reducing the dimensionality will help to determine different context sequences depending on the patient phenotype.
As further work can be mentioned also the task for context sequences similarities measuring. It can be used to identify synonyms and semantically close phrases.