Predicting Chronic Heart Failure Using Diagnoses Graphs

. Predicting the onset of heart disease is of obvious importance as doctors try to improve the general health of their patients. If it were possible to identify high-risk patients before their heart failure diagnosis, doctors could use that information to implement preventative measures to keep a heart failure diagnosis from becoming a reality. Integration of Electronic Medical Records (EMRs) into clinical practice has enabled the use of computational techniques for personalized healthcare at scale. The larger goal of such modeling is to pivot from reactive medicine to preventative care and early detection of adverse conditions. In this paper, we present a trajectory-based disease progression model to detect chronic heart failure. We validate our work on a database of Medicare records of 1.1 million elderly US patients. Our supervised approach allows us to assign likelihood of chronic heart failure for an unseen patient’s disease history and identify key disease progression trajectories that intensify or diminish said likelihood. This information will be a tremendous help as patients and doctors try to understand what are the most dangerous diagnoses for those who are susceptible to heart failure. Using our model, we demonstrate some of the most common disease trajectories that eventually result in the development of heart failure.


Introduction
Today the healthcare industry finds itself at the precipice of a significant change, as the past decade has seen the adaption and integration of electronic medical records (EMR) into clinical practice. Beyond the logistical benefits of maintaining and organizing patients' medical data, clinicians and researchers can perform novel research using these secondary data sources [1][2][3]. In fact EMRs ability to provide a computationally accessible set of structured data representing the expansive healthcare feature space has fueled the emergence of a sundry of informatics tools ranging from early clinical decision support systems, to the statistical analysis of, to predictive analytics aimed at identifying patients at risk for readmission [4,5].
Building on the success of these tools, many researchers have seen healthcare informatics as the junction between another line of parallel clinical research, the shift from reactive to preventative medicine. Medical research is itself an evolving field, and has advanced in parallel with the emergence of EMR. Clinicians have put forth a strong effort in advancing the care paradigm from reactive medicine, where clinicians treat the conditions currently afflicting a patient, to preventative care where clinicians undertake courses of action "for the purpose of preventing disease or detecting it in an asymptomatic stage" [6]. As such, the early detection and treatment of adverse health conditions represents an exciting opportunity for the informatics community. Others have found that a combination of research areas, including, but not limited to, graph-based data mining, entropy-based data mining, and topological-based data mining, work best for knowledge discovery and towards an end goal of supplementing human learning with machine learning [7]. Eventually the goal is to have P4-medicine (predictive, preventative, participatory, personalized) available for all patients by using big data and the combined human computer interaction and knowledge discovery/data mining approach [7].
A number of works have built on this foundation, focusing predictive tasks from disease prediction, to the prediction of breast cancer survivability [8,9]. However, these tools suffer from a fundamental flaw, they identify patients' health conditions as isolated events, i.e. a disease will occur in a patient's future medical chart, or a patient will recover from early stage breast cancer. One must remember that an individuals' health condition does not only consist of when doctors measure them in a clinical environment. Although the rate of onset may vary, the progression of disease represents a highly fluid state. As such, it may be more valuable to view these patients' conditions as trajectories, rather than binary events.
While this seems like a significant shift in thinking, medical subfields have already established the concept of a disease trajectory, sometimes denoted as disease 'progression'. In particular, research in relation to neurodegenerative disorders such as Parkinson's and Alzheimer's have quite well established this concept [10,11]. More recently, the trajectory concept has begun expanding into the general healthcare population. Many clinicians have long postulated that an underlying progression of related diagnoses may relate to diagnoses for which we do not explicitly relate a temporal aspect. Today, the data collected through the expanding EMR now allows for researchers to examine such hypotheses in detail. Perhaps Jensen et al., have provided one of the best examples to date, where through their work they successfully extracted diagnosis trajectories by analyzing millions of longitudinal patient records and utilized a novel way of describing biological disease progression [12].
In this work, we build on this concept and present a novel graph-based diagnosis trajectory model. While recent advances have taken what effectively represent an "unsupervised" approach to trajectory discovery, we aim to provide a target based "supervised" methodology. We will begin with a discussion of the underlying methodology used in constructing the underlying diagnosis graph. From here, we will discuss utilizing the temporal relations extracted from the graph, showing that we can identify paths that significant differentiate the occurrence of the target diagnosis. Finally, we will provide a case study of the methodology in relation to patients with congestive heart failure.

Data Description
Electronic Medical Records (EMRs) log information on patients in the form of diagnosis codes for each of their visits. This log effectively narrates a patient's medical history as identified by medical practitioners and can predict their future health outcomes. Here we describe our data source, the data structure, how chronic heart failure appears these diagnoses logs and how prevalent it is within our patients. Provenance. Our data comes from the Medicare records of 1,145,541 elderly patients in the United States. The accuracy and completeness of these records makes them invaluable to demographic and epidemiological research [9,13,14]. The data is completely anonymized-both in terms of the patients and the healthcare providers. For a given patient, we applied a threshold of a maximum of 5 in-patient visit, and each visit corresponds to a maximum of 10 diagnosis codes from the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). These ICD-9-CM codes are designed to convey an intrinsic hierarchy of diagnosis detail-the full 5-digit code represents the specific condition, location and/or severity, and its leading 3 digits represent the medical diagnosis family. This "code collapse" [9] helps us identify the family of patients who develop heart failure in our data.
Identifying heart failure. We observe ground truth evidence of presence/absence of heart failure with ICD-9-CM diagnoses for individual patients. Diagnoses represented by the family of '428.xy' codes cover all diagnoses for heart failure. Specific examples of the 428 diagnosis family include Systolic heart failure (428.2); which breaks down into Systolic heart failure, unspecified (428.20), Acute systolic heart failure (428.21), Chronic systolic heart failure (428.22) and Acute on chronic systolic heart failure (428.23). We labeled as 'HF' all patients for whom we observed the '428' diagnosis family, and labeled the rest as 'NHF' for heart failure and non-heart failure respectively. Table 1 shows a sample patient's medical history. Here we see the chronological history of the patient through each successive visit expressed in terms of full ICD-9-CM codes. For each visit, the first code is the principal diagnosis, followed by any secondary diagnoses made during that visit. The data presents these diagnoses in their full ICD-9-CM form where 733.00 represents Osteoporosis, unspecified. Some diagnoses, such as Pathologic fracture (733.1) and Nutritional marasmus (261), use fewer than the maximum 5 digits in ICD-9-CM. Using Table 1 as an example, we see that in visit #5, the patient was diagnosed with Congestive heart failure, unspecified (428.0) and therefore belongs to the class 'HF'.

Summary Statistics.
The EMR data used in this study covers 1,145,541 elderly Medicare patients over the course of 5,727,705 total visits. Over the course of these visits, the patients registered a total of 12,396 unique ICD-9-CM diagnoses codes, which represent 1,064 families of 3 digit collapsed codes. This set of patients exhibits a heart failure rate of 46.6%, which is extremely high compared to the United States, about 5.7 million (2.2%) adults have heart failure [15]. However, some have observed the overall prevalence of heart failure in elderly patients in the United States as high as 10.6 to 13.5% (Chart 20-2 [15]). Since our study focuses on patients on Medicare, this number is further amplified.
Experiments. Based on this EMR data, we group our analysis into two distinct phases-(1) building a representational model for heart failure and (2) predicting heart failure outcomes for unseen patients. First, we infer the nature of disease progression for patients with and without observed heart failure. Based on the learned model, we identify trajectories, individual diagnoses, and edges that give the best indication of heart failure. We then use this model on previously unseen patients and predict whether they will develop heart failure and validate this against observed ground truth data about these test patients.

Building a Representational Predictive Model
Researchers have built contemporary disease progression models using patient data with already known target outcomes. [10,11]. In contrast, our supervised approach helps contextualize disease progression trajectories against an in-situ control set of patients, i.e. our data contains trajectories followed by patients who eventually were diagnosed with Heart Failure and those who were not. This approach highlights the diagnosis trajectories that intensify or diminish likelihood of heart failure in patients. The identification of such divergence in diagnoses helps pinpoint signals for heart failure from overall population trends. In this section, we describe how we restructure Medicare EMR data to obtain supervised disease progression trajectories. We then merge individual trajectories in the form of a compact Directed Acyclic Graph to model class-aware patient population-wide trends in diagnoses. Using this model, we identify key differentiating diagnoses and trajectories that help separate patients who are likely to develop heart failure from those who do not.
Preprocessing. We transform the data from raw medical histories shown in Table 1 to extract class-aware trajectories for patients using the following steps. We first collapse the diagnosis codes to their 3 digit counterparts, then eliminate duplicate families of diagnoses and then decouple the diagnosis history used for prediction from the observed outcome. Table 2 shows the result of applying this preprocessing to the example history from Table 1.

Removing patients who receive a heart disease diagnosis on their first visit
Out of the 46.6% of the patients in our dataset who develop heart failure, 18.0% receive a heart failure diagnosis in their very first visit. Since this study revolves around the concept of diagnoses leading up to heart failure, we consider these patients out of scope for our training and testing data. This removal of heart failure cases reduces the rate of observed heart failure in the rest of the patients down to 34.8% from the original 46.6%. 2. Decoupling input data and target labels-In patients with heart failure, we right-censor the diagnosis data when the first '428' code appears. This ensures that there is no "data leakage", i.e. we do not predict heart failure based on an observed diagnosis of heart failure since it is a chronic condition. 3. Pre-pruning diagnoses and pathways-To mitigate the impact of spurious/noisy disease trajectories in our analysis, we set a minimum support threshold of 100 for the nodes and edges in our graph. By imposing this threshold, we ensure that none of the diagnoses or the pathways between them draws conclusions from a set of fewer than 100 patients out of a total sample size of 1.1M patients. 4. Code Collapse-The original data contains 12,396 "Minor Category" diagnosis codes, whereas our analysis targets the "Major Category" outcome (heart failure). Collapsing the 5-digit diagnoses codes down to their respective 3-digit major categories helps reduce the complexity of the problem and matches the granularity of the observed outcome. As a result, we now use 1,064 diagnosis families to chart patient trajectories, which is 8.6% of the original complexity. 5. Removing duplicate diagnoses-We only consider new and previously unobserved diagnoses in our analyses. In Table 1, this means that we consider diagnoses for Other disorders of bone and cartilage (733) only for their first visit. We hope to address the trade-off of dropping duplicate diagnosis in Future Work. 6. Removing superfluous diagnoses-ICD-9-CM diagnosis codes starting with V (Supplementary classification of factors influencing health status and contact with health services) and E (External causes of injury) reveal little about the progression of disease and were taken out of the graph.

Disease Progression for Individual Patients.
For the example patient in Table 2, we can now create a disease progression history based on their diagnoses Fig. 1. Each node represents a diagnosis and each edge (e(i, j)) represents a potential transition from diagnosis i to j across successive patient visits. Each of these edges is strictly directed from a diagnosis in visit (t − 1) to a diagnosis in visit t and nodes in the same visit do not have edges between them. This makes the graph a Directed Acyclic Graph (DAG). Each patient in our training data has an outcome label associated with them, which we use as a label for each of these patient-centric DAGs (Fig. 2).   outcome for an unseen patient. Within our patient population, we perform a 5-fold cross-validation, training on 80% of the patients and testing the model on the other 20% over 5 folds of the data. As shown in Fig. 2, we simply combine observed nodes and edges across all training set patients and create a composite DAG. The nodes at each level represent the superset of possible diagnoses at that visit and edges between each level represent the observed transitions between diagnoses.

A Class-Aware
The weight of each edge e(i, j) corresponds to the observed confidence of heart failure among patients who were diagnosed with i and then j. Similarly, we assign a weight to each node n representing the observed confidence of heart failure for that node. A patient can have multiple diagnoses within the same visit, each of which adds to the support of the corresponding nodes and edges. This does not guarantee the total incoming support into a node being equal to the total outgoing support. Another salient feature of this model is that it cane distinguish the same diagnosis code between visits. For example, if one observes code '261' in visit #1 and #2 for different patients, we create nodes labeled '261 1' and '261 2' Model Inference. The overall trained model is an interconnected representation that contains 1,974 nodes and 26,229 edges. with an average in-degree and out-degree of 13.29. The degree distribution of the trained diagnosis graph is given in Fig. 3. A relatively few nodes and edges contain a high likelihood of heart failure as seen in Fig. 4. These high-confidence nodes and edges indicate underlying diagnoses and trajectories that lead to high rates of heart failure.
We identify nodes and edges with a high propensity for heart failure in Tables 3 and 4 respectively. These nodes and edges describe diagnostic pathways that indicate heart failure. In addition to extremely high likelihood of heart failure, we also identify diagnoses that effectively discern heart failure. For this, we use information gain (or InfoGain, for short) for successive diagnoses in patients. A higher information gain indicates a higher class polarization between

. Information Gain in Edges:
We compute the information gain for each edge and use it as an edge attribute. By making information gain class aware, we treat nonheart failure intensifying edges as negative information gains. This makes it easier to isolate signals for heart failure intensifying paths.
heart failure and non-heart-failure. Information Gain (IG) is the reduction in entropy for a given edge in our model. Specifically, an edge e(i, j) with high information gain indicates that the diagnosis of j after i leads to a higher confidence of arriving at either class. We compute each node's entropy from its class distribution using

IG(i, j) = H(i) − H(j)
where i and j are the respective source and destination diagnoses, k ∈ {HF, N HF }, and p k (.) represents the probability of observing class k in a given node. Information Gain for e(i, j) is simply IG(i, j) = H(i) − H(j), where higher values of IG are more helpful in our search for heart failure propensity intensifying markers. The sheer abundance of pathways towards non-heart-failure outcomes eclipses the relatively low InfoGain of individual edges which intensify heart failure. These are the majority of the edges which form the positive side of Fig. 5a. However, we are primarily interested in edges which intensify likelihood of heart failure. To achieve this, we artificially penalize InfoGain in edges which have a higher likelihood of non heart failure by simply making them negative. This isolates and highlights heart-failure intensifying pathways in the network. Figure 5b shows how this transformation affects the edge attributes and isolates the relatively few edges which exhibit a high InfoGain favoring heart failure.
The above steps outline how we process raw patient records into a supervised representational model. This graphical model not only amalgamates patient disease trajectories, but it also highlights key pathways leading to heart failure.

Predicting Heart Failure for an Unseen Patient
Now that we have a representational and interpretable model to predict heart failure, we use it to predict outcomes for our held-out test dataset. We describe how we convert a new patient's diagnosis history into predicted probabilities and how we evaluate these predicted outcomes.

How to Predict.
Given a test patient's diagnosis history, we replicate the steps in the training section to arrive at a graph similar to Fig. 1. Here, we make an important assumption about the nature of our model-we assume that the probabilities at each stage obey a Bayesian model. Using the probabilities from our trained model, we can predict relative odds of heart failure and nonheart failure by simply multiplying the class-wise probabilities for each edge and normalizing them. Given a test patient with a disease progression graph G test , the unnormalized value of P(Y = HF ) * = e p(e(i, j)), ∀e(i, j) ∈ G test and p(e(i, j)) ∈ G trained . We then similarly compute the unnormalized value of P(Y = NHF ) * and finally output the normalized value of P(Y = HF ).
In this work, we assume a Markovian model when using the Bayesian network structure to model disease progression. This means that dependencies and graph attributes (for instance, support and confidence) do not extend beyond immediate descendants directly in our model, i.e. A|B and B|C can model disease progression, but not A|B, C directly. In future work, the model can extend to include higher-order dependencies [16]. This would enable us to model dependencies of the form A|B, C and beyond.
Evaluation and Baselines. We compute the above probabilities for all patients in the test set and evaluated against their true observed outcomes. We then compute the Receiver Operating Characteristics in terms of False Positive Rate and True Positive Rate for these predictions. Our key prediction metric is the area under the ROC curve (AUROC), a higher AUROC indicating superior predictive performance.
How soon can we predict Heart Failure? Each visit in our trained model is represented by a layer of nodes in the DAG. A prediction made using the first i layers of the DAG corresponds to a prediction made on i visits of an unseen test patient. Deliberately pruning the number of layers in the trained DAG is equivalent to reducing the complexity of our trained model and being able to predict our target outcome earlier. In order to see if this trade-off negatively influences predictive performance, we test the unseen patient histories on DAGs pruned to predict heart failure from 1, 2, 3 and 4 (maximum number of test visits in our data) visits and evaluate their area under the ROC curve.

Results
The techniques described above cover three key aspects of our research. First, we train a class-aware model of heart failure from patient history data. Second, we interpret the model to identify key diagnoses and disease progression pathways which intensify or mitigate the chances of a given patient developing heart failure. Third, we show how this model performs when predicting heart failure outcomes for a completely unseen set of patients.

Model Inference.
Looking at the highest confidence nodes for heart failure in Table 3, several common themes appear in diagnoses that tend to proceed heart failure -namely, rheumatic fever, pulmonary congestion, cardiomyopathy, blood poisoning, kidney disease, hypertension, and aortic and mitral valve disease. For these diagnoses, it does not appear to matter much if one diagnoses a patient on 1 or 2 -the progression to heart disease seems to occur at about the same confidence levels. By absolute numbers in this data set, the diagnoses that lead to heart failure the most are cardiomyopathy and aortic mitral valve diseases. Rheumatic heart diseases and pulmonary congestion patients appear less than the former three in the data set, but have a higher probability of leading to heart failure. Referring to the highest confidence edges for heart failure given in Table 4, we can see that may of the same destination diagnoses match the diagnoses given in the high confidence nodes in Table 3. The edges give us some idea about the diagnoses that come first that may lead to heart disease given another diagnosis. For instance, the high confidence nodes in Table 3 told us that diagnoses such as rheumatic fever, pulmonary congestion, and cardiomyopathy lead to heart failure. 14 out of the 20 top confidence edges involve cardiomyopathy, which indicates that cardiomyopathy is a strong component in leading to heart failure. Cardiac dysrhythmia is a diagnosis that is particularly deadly when combined with further diagnoses. The findings of this graph seem to confirm the results of other studies. Others have identified cardiomyopathy and valve dysfunction as precursors for heart disease [17][18][19]. The American Heart Association has recommended that patients who have chronic kidney disease are in the highest risk group for development of cardiovascular disease [18]. Researchers have associated nephrotic syndrome with cardiovascular disease [18]. Pulmonary congestion is very common in patients with heart failure due to its relation to high pressure in the left ventricle of the heart. Many patients who have heart failure are also found to have fluid overload which is a common result of pulmonary congestion. Detection and treatment of pulmonary congestion can help prevent progression of heart failure [20]. The result of blood poisoning and sepsis is often multiple organ failure, including septic cardiomyopathy, which can lead to heart failure [21]. Other studies have found that chronic pulmonary heart disease is a predictor of chronic heart failure in China [22]. For many years, doctors have known that rheumatic fever can contribute to heart failure occurring later in life [23]. Hypertension is also a major contributing factor in congestive heart failure [24].
Almost all the top confidence edges involved rheumatic heart disease, pulmonary congestion and hypostasis, mitral and aortic valve disease, or cardiomyopathy, which therefore accentuates the importance of those diseases in the diagnosis of heart failure. The high confidence edges given in Table 4 let us know that diseases such as cardiac dysrhythmia, diabetes, ischemic heart disease, and lung diseases, diagnosed beforehand, can ultimately result in heart failure. Table 5 shows us the highest information gain nodes tend to come from source diagnoses that are mental or noncardiac in nature (Affective psychoses, cerebral degeneration, malignant neoplasm of bladder, Parkinson's disease, etc.) followed by an acute myocardial infarction, endocardium diseases, or cardiomyopathy. This seems to suggest that these diagnoses are the first cardiac problems that might occur in patients with other mental or noncardiac issues. This model of information gain suggests that screening for those three diseases, since they appear as some of the first cardiac diagnosis on a trajectory that leads to heart failure.
Heart Failure Prediction. Our model can predict heart failure in patients from diagnoses from their second visit (i.e.: their first disease progression) as seen in Fig. 6. Adding diagnoses from subsequent visits makes the predictive performance plateau in comparison to the second visit. As discussed in our Data Preprocessing stage in Sect. 3, we only have a maximum of 4 patient visits for our prediction task.

Discussion
Which disease progression trajectories lead to heart failure? A diagnosis of cardiomyopathy is a very common theme that appears in many high confidence edges and high information gains that lead to heart failure. Cardiomyopathy appears in 14 out of the 20 high confidence edges and 5 of the top 20 high information gains. We can therefore conclude that cardiomyopathy is an important factor in the progression of a heart failure trajectory. Monitoring patients for cardiomyopathy and intervening early is therefore important in limiting heart disease.
Besides cardiomyopathy, most of the other high information gain edges had a destination edge of acute myocardial infarction or endocardium diseases. These three appear as "gateway" diagnoses that eventually results in heart failure later in their medical record for patients who do not currently have a diagnosis of heart disease.
While cardiomyopathy is very common in the high confidence edge nodes, it does not occupy the top four high confidence edges. Cardiac dysrhythmia appears as a source diagnosis in the two top confidence edges, indicating that those with cardiac dysrhythmia should watch out for rheumatic heart disease or pulmonary congestion. Additionally, pulmonary congestion appears as a destination node for three out of the top four confidence edges, indicating that pulmonary congestion is a complication that, given other diagnoses such as cardiac dysrhythmia, diabetes, or chronic ischemic heart disease, could eventually result in heart failure.
Can we predict heart failure? Using this model, we observe we can predict heart failure using the conclusions found from this data. The ROC curve given in Fig. 6 indicates that diagnoses given in the first visit contains most of the information that leads to heart failure. We receive diminishing returns from subsequent diagnoses after that first visit. One reason could be that most of the diseases that eventually result in heart failure have already appeared by their first visit to a doctor, and rarely do patients not have diseases that are indicative of heart failure at their first visit, and then they go on to get heart failure later. Table 5 gives some examples of patients that have the highest jump in the probability of developing heart failure after their first visit. Certain cardiac events put those who were originally being treated for mental diagnoses in particular (affective psychoses, cerebral degenerations, Parkinson's disease, neurotic disorder) on a path to heart failure beginning in Visit 2. In general, the data tells us that the disease progression from Visit 1 to Visit 2 gives the most indication that a patient will eventually become a heart failure patient.

Conclusion
By constructing a DAG of Medicare patients and their visits, we found trends in diseases that result in an ultimate diagnosis of heart failure. We conclude that cardiomyopathy is a condition that is commonly associated with heart failure such that screening for cardiomyopathy should be a common part of preventative treatment. Additionally, we know that many patients' first diagnoses on a heart failure path are acute myocardial infarctions, endocardium diseases, and cardiomyopathy. Doctors who see patients for other medical issues, especially mental issues as observed, should know of these complications since they are often the first that show up in diagnoses that do not otherwise lead to heart failure. We also found that rheumatic heart disease, pulmonary congestion and hypostasis, cardiomyopathy, blood poisoning, and valve and aortic diseases are common comorbidities that occur before doctors diagnose patients with heart failure. Because the highest information gains in our DAG are on paths that concern mental disorders such as psychosis, cerebral degeneration, and Parkinson's, the conclusion can be made that patients being seen for these disorders should also be monitored for heart disease.
The ultimate goal of such a system is to be able to effectively predict likelihood of heart failure, which we demonstrate using our trained DAG. We show that the most indicative diagnoses belong to the first disease progression in terms of their information gain and area under the ROC curve. This underscores the usefulness of our model in extracting signals which can be used for early detection of heart failure.