Marys Medicine

Intelligible Models for HealthCare: Predicting Pneumonia
Risk and Hospital 30-day Readmission
Microsoft Research LinkedIn Corporation Microsoft Research Columbia University the application of machine learning to important problems inhealthcare such as predicting pneumonia risk. In the study, In machine learning often a tradeoff must be made between the goal was to predict the probability of death (POD) for accuracy and intelligibility. More accurate models such as patients with pneumonia so that high-risk patients could be boosted trees, random forests, and neural nets usually are admitted to the hospital while low-risk patients were treated not intelligible, but more intelligible models such as logistic as outpatients. In the study , the most accurate mod- regression, naive-Bayes, and single decision trees often have els that could be trained were multitask neural netOn significantly worse accuracy. This tradeoff sometimes limits one dataset the neural nets outperformed traditional meth- the accuracy of models that can be applied in mission-critical ods such as logistic regression by wide margin (the neural applications such as healthcare where being able to under- net had AUC=0.86 compared to 0.77 for logistic regression), stand, validate, edit, and trust a learned model is important.
and on the other dataset used in this paper outperformed We present two case studies where high-performance gener- logistic regression by about 0.02 (see Table . Although alized additive models with pairwise interactions (GA2Ms) the neural nets were the most accurate models, after careful are applied to real healthcare problems yielding intelligible consideration they were considered too risky for use on real models with state-of-the-art accuracy.
patients and logistic resgression was used instead. Why? risk prediction case study, the intelligible model uncovers One of the methods being evaluated was rule-based learn- surprising patterns in the data that previously had pre- ing Although models based on rules were not as accurate vented complex learned models from being fielded in this as the neural net models, they were intelligible, i.e., inter- domain, but because it is intelligible and modular allows pretable by humans. On one of the pneumonia datasets, these patterns to be recognized and removed. In the 30- the rule-based system learned the rule "HasAsthma(x) ⇒ day hospital readmission case study, we show that the same LowerRisk(x)", i.e., that patients with pneumonia who have methods scale to large datasets containing hundreds of thou- a history of asthma have lower risk of dying from pneumo- sands of patients and thousands of attributes while remain- nia than the general population. Needless to say, this rule ing intelligible and providing accuracy comparable to the is counterintuitive. But it reflected a true pattern in the best (unintelligible) machine learning methods.
training data: patients with a history of asthma who pre-sented with pneumonia usually were admitted not only to Categories and Subject Descriptors the hospital but directly to the ICU (Intensive Care Unit).
I.2.6 [Computing Methodologies]: Learning—Induction The good news is that the aggressive care received by asth-matic pneumonia patients was so effective that it lowered their risk of dying from pneumonia compared to the generalpopulation. The bad news is that because the prognosis for intelligibility; classification; interaction detection; additive these patients is better than average, models trained on the models; logistic regression; healthcare; risk prediction data incorrectly learn that asthma lowers risk, when in factasthmatics have much higher risk (if not hospitalized).
One of the goals of the study was to perform a clinical trial In the mid 90's, a large multi-institutional project was to determine if machine learning could be used to predict funded by Cost-Effective HealthCare (CEHC) to evaluate risk prior to hospitalization so that a more informed decisionabout hospitalization could to be made. The ultimate goal Permission to make digital or hard copies of all or part of this work for personal or was to reduce healthcare cost by reducing hospital admis- classroom use is granted without fee provided that copies are not made or distributed sions, while maintaining (or even improving) outcomes by for profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than the more accurately identifying patients that need hospitaliza- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or tion. As the most accurate models, neural nets were a strong republish, to post on servers or to redistribute to lists, requires prior specific permission candidate for clinical trial. Deploying neural net models that and/or a fee. Request permissions from
could not be understood, however, was deemed too risky — KDD'15, August 10-13, 2015, Sydney, NSW, Australia.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3664-2/15/08 .$15.00.
SVMs and boosted trees were not in common use yet, and Random Forests had not yet been invented.
if the rule-based system had learned that asthma lowers risk, mission data, respectively. Section discusses a wide range certainly the neural nets had learned it, too. The rule-based of issues that arise when learning with intelligible models system was intelligible and modular, making it easy to recog- and our general lessons for the research community.
nize and remove dangerous rules like the asthma rule. Whilethere are methods for repairing the neural nets so they do INTELLIGIBLE MODELS not incorrectly predict that asthmatics are at lower risk and Let D = {(xi, yi)}N thus less likely to need hospitalization, e.g., re-train without denote a training dataset of size N , where xi = (xi1, ., xip) is a feature vector with p features asthmatics in the population, remove the asthma feature, and yi is the target (response). We use xj to denote the jth modify the targets for asthmatics to "1" in the data to re- variable in the feature space.
flect the care they received (unfortunately confounding care Generalized additive models (GAMs) are the gold stan- with death), the decision was made to not use the neural nets dard for intelligibility when low-dimensional terms are con- not because the asthma problem could not be solved, but be- sidered Standard GAMs have the form cause the lack of intelligibility made it difficult to know whatother problems might also need fixing. Because the neural nets were more accurate than the rules, it was possible thatthe neural nets had learned other patterns that could put where g is the link function and for each term fj , E[fj ] = 0.
some kinds of patients at risk if used in a clinical trial. For Generalized linear models (GLMs), such as logistic regres- example, perhaps pregnant women with pneumonia also re- sion, are a special form of GAMs where each fj is restricted ceive aggressive treatment that lowers their risk compared to be linear. Since the contribution of a single feature to the to the general population. The neural net might learn that final prediction can be easily understood by examining fj , pregnancy lowers risk, and thus recommend not admitting such models are considered intelligible.
pregnant women, thus putting them at increased risk. In an To improve accuracy, pairwise interactions can be added effort to "do no harm", the decision was made to go forward to standard GAMs, leading to a model called GA2Ms only with models that were intelligible such as logistic regres- sion, even if they had lower AUC than other unintelligible ij (xi, xj ).
models. The logistic regression model also learned that hav- ing asthma lowered risk, but this could easily be corrected Note that pairwise interactions are intelligible because they by changing the weight on the asthma feature from negative can be visualized as a heat map. GA2M builds the best to positive (or to zero).
GAM first and then detects and ranks all possible pairs of Jumping two decades forward to the present, we now interactions in the residuals. The top k pairs are then in- have a number of new learning methods that are very ac- cluded in the model (k is determined by cross-validation).
curate, but unfortunately also relatively unintelligible such There are various methods to train GAMs and GA2Ms.
as boosted trees, random forests, bagged trees, kernelized- Each component can be represented using splines, leading to SVMs, neural nets, deep neural nets, and ensembles of these an optimization problem which balances the smoothness of methods. Applying any of these methods to mission-critical splines and empirical error . Other representations include problems such as healthcare remains problematic, in part regression trees on a single or a pair of features. Empirical because usually it is not ethical to modify (or randomize) study showed gradient boosting with bagging of shallow re- the care delivered to patients to collect data sets that will gression trees yields as components very good accuracy not suffer from the kinds of bias described above. Learning Interested readers are referred to for detail must be done with the data that is available, not the dataone would want. But it is critical that models trained on CASE STUDY: PNEUMONIA RISK real-world data be validated prior to use lest some patients In this case study we use one of the pneumonia datasets be put at risk, which makes using the most accurate learning discussed earlier in the motivation .
14,199 pneumonia patients. To facilitate comparison with In this paper we describe the application of a learning prior work, we use the same train and test set folds from the method based on high-performance generalized additive mod- earlier study: the train set contains 9847 patients and the els to the pneumonia problem described above, and to test set has 4352 patients (a 70:30 train:test split). There a modern, much larger problem predicting 30-day hospital are 46 features describing each patient. These range from readmission. On both of these problems our GA2M models history features such as age and gender, to simple measure- yield state-of-the-art accuracy while remaining intelligible, ments taken at a routine physical such as heart rate, blood modular, and editable. We believe this class of models repre- pressure, and respiration rate, to lab tests such as White sents a significant step forward in training models with high Blood Cell count (WBC) and Blood Urea Nitrogen (BUN), accuracy that are also intelligible. The main contributions of to features read from a chest x-ray such as lung collapse or this paper are that it: shows that GA2Ms yield competitive pleural effusion. See Table for a complete list.
accuracy on real problems; demonstrates that the learned As discussed earlier, the goal is to predict probability of models are intelligible; demonstrates that the predictions death (POD) so that patients at high risk can be admit- made by the model for individual cases (patients) also are ted to the hospital, while patients at low risk are treated as intelligible, and demonstrates how, because the models are outpatient10.86% of the patients in the dataset (1542 pa- modular, they can be edited by experts.
tients) died from pneumonia. The GAM/GA2M models are The remainder of the paper is organized as follows. Sec- tion provides a brief introduction to GAM and GA2M.
Code is available at Sections and present our case studies of training intelli- Hospitals are dangerous places, particularly for patients with impaired immune systems. Treating low-risk patients gible GA2M model on the pneumonia and the 30-day read- as outpatients not only saves money, but is actually safer.
Patient-history findings chronic lung disease re-admission to hospital Logistic Regression admitted through ER diabetes mellitus admitted from nursing home congestive heart failure ischemic heart disease number of diseases cerebrovascular disease history of seizures chronic liver disease history of chest pain Physical examination findings Table 2: AUC for different learning methods on the diastolic blood pressure pneumonia and 30-day readmission tasks.
altered mental status presented in sorted order as a scrollable list of graphs plus Laboratory findings liver function tests The 1st term in the model is for age. Age (in years) on the x-axis ranges from 18-106 years old (the pneumonia dataset contains only adults).
The vertical axis is the risk score predicted by the model for patients as a function of age. The risk score for this term varies from -0.25 for patients with age Chest X-ray findings less than 50, to a high of about 0.35 for patients age 85 and positive chest x-ray above. The green errorbars are pseudo-errorbars of the risk score predicted for each age: each errorbar is ±1 standard deviation of the variation in the risk score measured by 100 lobe or lung collapse rounds of bagging. We use ±1 standard deviation instead of the standard error of the mean because it is well known Continuous features that will be shaped by that bagging underestimates the variance of predictions from GAM/GA2M models are marked with a "C".
complex models. We believe it is safer to be conservativethan to present unrealistically narrow confidence intervals.
(See the top of Figure a) for an enlarged version of thisgraph, and the discussion in Section for more detailed trained on this data using 100 rounds of bagging. Bagging is analysis of the age feature.) done to reduce overfitting, and to provide pseudo-confidence The 2nd term in the model, asthma, is the one that caused intervals for the graphs in the intelligible model.
trouble in the CEHC study in the mid-90's and prevented The AUC area for different models trained on this data are clinical trials with the very accurate neural net model. The shown in Table On this dataset logistic regression achieves GA2M model has found the same pattern discovered back AUC = 0.843, Random Forests achieves 0.846, LogitBoost then: that having asthma lowers the risk of dying from pneu- 0.849, GAM 0.854, and GA2M is best with AUC = 0.857 monia. As with the logistic regression and rule-based mod- The difference in AUC between the methods is not huge (less els trained then, but unlike with the neural net models, this than 0.02), but it is reassuring to see the GAM/GA2M meth- term is easy to recognize and fix in the GA2M model. We ods achieve the best accuracy on this problem.
can "repair" the model by eliminating this term (effectively portant question is if the GAM/GA2M models are able to setting the weight on this graph to zero), or by using hu- achieve this accuracy while remaining intelligible? man expertise to redraw the graph so that the risk score Figure shows 28 of the 56 terms in the GA2M model for asthma=1 is positive, not negative. Because asthma is for pneumonia. Unfortunately, the compact representation boolean, it is not necessary to use a graph, and we could necessary for the paper reduces intelligibility.
present a weight and offset (RiskScore = w*hasAsthma + models like this with fewer than 100 terms we would pre- b) instead. We prefer to use graphs for boolean terms like fer to present all terms, possibly sorted by their importance asthma for three reasons: 1) it is necessary to show graphs to the model. In the actual deployment, for each term we for features with multiple or continuous values such as age would also show a histogram of data density for different as well as for interactions between features, and it is awk- values of the feature, descriptive statistics about the fea- ward for the user to jump from terms presented as graphs to ture, several different measures of term importance in the terms presented as equations; 2) we find graphs provide an model, and links to online resources that provide informa- intuitive display of risk where up implies higher risk, down tion about the term, e.g., links to a hospital database, or implies lower risk, and the magnitude of the change in risk is Wikipedia or WebMD pages that describe features, how they captured by the distance moved; and 3) some users are not are measured, what the normal ranges are, and what abnor- as comfortable with numbers as they are with graphs, and mal values indicate. Because of space limitations we have it is important that the model is intelligible to real users, suppressed all of this auxiliary information (including some whatever their background.
axis labels!) and just present shape plots for some of the The 3rd term in the model is BUN (Blood Urea Nitro- more interesting terms. Presenting the terms in multicol- gen) level. Most patients have BUN=0 because, as in many umn format without the auxiliary information further hin- medical datasets, if the variable is not measured or assumed ders intelligibility — the models are more readable when normal it is coded as 0. The model says risk is reducedfor patients where BUN was not measured, suggesting that 4The GA2M model uses 10 of the 46 ∗ 45/2 = 1035 possible this test typically is not ordered for patients who appear pairwise interaction terms (k chosen by cross-validation).
to be healthy. BUN levels below 30 appear to be low risk, chronic lung disease congestive heart failure diastolic blood pressure history of chest pain -20 0 20 40 60 80 age vs. respiration rate 0 50 100 150 200 250 -100 -50 0 50 100 150 respiration rate vs. BUN Figure 1: 28 (of 56 total) components for the GA2M model trained on the pneumonia data. The line graphsare terms that contain single features. The heat maps at the bottom are pairwise interaction terms. Thevertical scale on all line graphs are the same to facilitate rapid scanning of the relative contribution of eachterm. The green errorbars are pseudo-errorbars from bagging. Boolean features such as asthma are presentedas graphs because this aids interpretation among other features that must be presented as graphs.
while levels from 50-200 indicate higher risk. This is con- in childhood but not cured when the patient reaches age 18), sistent with medical knowledge which suggests that normal, and declines for patients who acquire cancer later in life, but healthy BUN is 10-20, and that elevated levels above 30 may for patients without cancer risk rises as expected with age.
indicate kidney damage, congestive heart failure, or bleeding This is a classic interaction effect that likely results from the in the gastrintestinal tract.
difference between childhood and adult cancers.
The cancer term in the model is clear: having cancer sig- Space prevents us from discussing each term individually, nificantly increases the risk of dying from pneumonia, prob- or from discussing terms in great detail. See Section for a ably because it explains why the patient has pneumonia (ei- deeper dive on the age term. To summarize this section, the ther from lung cancer, from immuno suppressive drugs used GA2M model discovered the same asthma pattern that cre- to treat cancer, or from hospitalization associated with can- ated problems in the CEHC study, provides a simple mech- cer) and/or because it explains the stage of cancer (terminal anism to correct this problem, and uncovered other similar stages of cancer frequently lead to failing health and being problems (chronic lung disease and history of chest pain) bed-ridden, both of which can lead to pneumonia).
that were not recognized in the CEHC study but which war- The next term in the model, chronic lung disease, and the rant further investigation and probably repair. As hoped, history of chest pain term that occurs later, are interesting the GA2M model is accurate, intelligible, and repairable.
because the model suggests that chronic lung disease and ahistory of chest pain both decrease POD. We suspect that CASE STUDY: 30-DAY READMISSION this may be a similar problem as asthma: patients with In this section we apply GA2M to a modern and much lung disease and chest pain may receive care earlier, and larger dataset for 30-day hospital readmission. The data may receive more aggressive care. If further investigation comes from a collaboration with a large hospital.
suggests this to be the case, both terms would be removed are 195,901 patients in the train set (2011-2012), 100,823 from the model, or edited, similar to the asthma term.
patients in the test set (2013), and 3,956 features for each The # of diseases (# of comorbid conditions) is a general patient. Features include lab test results, summaries of doc- measure of illness. The graph suggests that having no dis- tor notes, and details of previous hospitalizations. In this eases other than pneumonia lowers risk, that risk increases problem, the goal is to predict which patients are likely to slowly as the number of comorbid conditions increases from be readmitted to the hospital within 30 days after being re- 1-3, then is flat or decreases until it rises dramatically above leased from the hospital. All patients in this dataset have 6, but the errorbars are large enough to be consistent with already been hospitalized at least once, and the goal is to risk being somewhat flat for 3-8 comorbidities.
predict if they will need to return to the hospital unusually Heart rate is an unusual looking graph.
quickly (within 30 days). Hospitals with abnormally high have rate=0, indicating it was not measured or assumed 30-day readmission rates are penalized financially because normal. Risk is high for very low heart rates (10-30), and a high rate suggests the hospital did not provide adequate for very high rates (125-200), but the model does not ap- care on the earlier admission, or may have released the pa- pear to discriminate between patients with heart rates 40- tient prematurely, or did not provide adequate instructions 120. On further inspection, there are no patients with heart to the patient when they were released, or did not perform rate recorded between 40-120! Apparently all patients in adequate follow-up after release. In the data 8.91% of pa- this range were considered "normal" and coded as 0. (Nor- tients are readmitted within 30 days. For this problem we mal heart rate in adults is about 60-100, 40-60 for athletes, use 10 iterations of bagging. Training the 10 models takes and somewhat higher than 100 for patients with "White 2-3 days on a small cluster of 10 general purpose computers.
Coat" Syndrome). The unusual shape of the graph for heart Table shows the AUC for different models on this data.
rate has lead us to discover a surprising aspect of the data, In Section we examined the GA2M model for the pneu- though it is not clear what risk we would want to model to monia problem. Unfortunately, the readmission dataset con- predict for rates between 40-120 where there is no data? tains almost 100 times as many features. Instead of trying The respiration rate term is very clear: rate=0 (missing) to examine the full model, we instead examine the predic- or 20-28 is low risk, and risk rises rapidly as breathing rate tions made by the model for three patients. Two of these climbs from 28-60.
Normal respiration rate for adults is patients have very high predicted probability of readmission 16-20. In the body temperature term, temps from 36◦C- (p=0.9326 and p=0.9264), and one of the patients has a typ- 40◦C are low risk (normal is 37◦C), risk is somewhat elevated ical readmission probability (p=0.0873). This allows us to at low body temps (32◦C-36◦C), and greatly elevated for demonstrate that the models are intelligible not only taken temps above 40.5◦C (fever this high often is a sign of serious as a whole, but that the predictions GA2M models make for infection). Having a fever above 41.5◦C increases the risk individual patients also are intelligible.
score by a full point or moDiastolic blood pressure also In Figure each of the three columns is a patient, and can dramatically increase risk: low diastolic in the range 20- each row is a term in the model. Terms are sorted for each 50 (normal is 60-80) increase risk as much as a full point. % patient (in each column) by the risk they contribute to that bands is also a strong term (bands in a blood test are a sign patient for 30-day readmission. Space limits us to showing of bacterial infection—bacterial pneumonia is more deadly the top 6 terms for each patient that contributed most to than viral pneumonia): bands above 40% indicate elevated risk. Patient #1 has a very high probability of readmission risk, with bands above 80% indicating very elevated risk.
within 30 days: p=0.9326. The four terms that contribute Before leaving pneumonia, let us examine one of the inter- most to their high probability of readmission are: their total action terms. In the age vs. cancer term, we see that risk is number of visits to the hospital is 40, they have been an in- highest for the youngest patients (probably cancers acquired patient in the hospital 19 times in the last 12 months, they 5An increase in risk of 1 point more than doubles the odds have been in the hospital 10 times in the last 6 months, and of dying. See Section 4 times in the last 3 months. This is not unusual: the most Patient 1: 0.9326 Patient 2: 0.9264 Patient 3: 0.0873 80 100 120 140 160 # inpatient visits ever prednisone preparations endometrial carcinoma # inpatient visits last 12 months etoposide preparation Malignant adenomatous neoplasm # inpatient visits last 6 months mesna prepartions clonazepam preparations 50 100 150 200 250 300 350 400 # inpatient visits last 3 months doxorubicin preparations whole blood hematocrit tests max amoxicillin preparations Intraductal carcinoma of breast 50 100 150 200 250 300 350 400 450 50 100 150 200 250 300 350 400 450 verapamil preparations ondansetron hydrochloride preparations # outpatient visits ever Figure 2: Top 6 terms (of 4456) in the GA2M for three patients. The patients on the left have high risk ofreadmission. The patient on the right has moderate risk. Terms are sorted by their contribution to risk.
Blue lines highlight feature values and corresponding risk scores. Six terms cannot tell the full story for thesepatients, but even these few terms provide insight into the patients and their risk of readmission.
predictive terms in the 30-day readmission model measure the number of visits patients have made in the last 12 month, 6 months, and 3 months to the ER, as an outpatient, and as an inpatient. As we see with this patient, a large number of recent inpatient visits (admissions) is associated with a high probability of readmission The next two terms suggest why patient #1 may have been in the hospital often: this patient has received large doses of amoxicillin (an antibioticused to treat infections like strep and pneumonia) and ver-apamil (a treatment for hypertension and angina), i.e., they Table 3: Risk scores (log odds) and the correspond- have an ongoing infection that may not be responding to ing probabilities.
antibiotics, and also probably have heart disease. The mainreason this patient is predicted to be likely to return is be-cause they have been in the hospital often in the last year, for each patient present a comprehensive picture of the risk but the first few terms in the model also give us a hint of factors that contribute to the probability of readmission pre- the medical conditions that put them at elevated risk.
dicted for a patient. The model is not causal — it does not The terms that are most important for patient #2 (also say that because the patient has X, they received treatments high risk: p=0.9364) are different from the terms that were A, B, and C, and we can see from the amount of A, B, and important for patient #1.
The most important 6 terms C they received that they are not responding well. Instead, are preparations that the patient received during their last it learns that high doses of A, B, and C are associated with visit: prednisone is a corticosteroid used as an imummo- high risk or readmission, and it is up to the human experts suppressant, etoposide in an anticancer drug, mesna is a to infer the underlying causal reasons for the feature values cancer chemotherapy drug, doxorubicin is a treatment for and the risk they predict. Nevertheless, compared to an un- blood and skin cancers, dexamethosone is another immuno- intelligible model such as an ensemble of 1000 boosted trees suppressant steroid, and ondansetron is a drug used to treat or a complex neural net, the model is fairly transparent, and nausea from chemotherapy. Patient #2 has received doses the predictions it makes can be fully "understood", both at of each of these preparations that suggest cancer may not be the per-patient level, and at the macro-model level.
responding well to treatment and that they are receiving ag-gressive chemotherapy. The contribution to risk from these 6 terms alone is greater than +1.5, i.e., these 6 terms alonetriple the odds of their being readmitted within 30-days.
How To Interpret Risk Scores Patient #3 has moderate risk: p=0.0873 (baseline rate Each term in the intelligible model returns a risk score is 8.91%). This 6 terms that increase this patient's read- (log odds) that is added to the patient's aggregate predicted mission risk the most are: 1) the patient has endrometrial risk. Terms with risk scores above zero increase risk; terms carcinoma (a cancer common in post-menopausal women with scores below zero decrease risk. The term risk scores that often can be treated effectively by hysterectomy with- are added to a baseline risk, and the sum converted to a out radiation- or chemo-therapy); 2) a benign abdominal probability. Both penumonia and 30-day readmission have tumor (malignant adenomatous neplasm =3); 3) a relaxant baseline rates near 0.1, which corresponds to TotalRiskScore typically prescribed to calm patients or reduce spasms; 4) a = -2.197. So patients with aggregate risk scores above -2.2 fairly typical (i.e. low risk) hematocrit test result; 5) a pre- have higher than average risk, and patients with total risk cancerous non-invasive lesion in the breast; and 6) a small scores below -2.2 have lower than average risk scores. A pa- number of outpatient visits suggesting they have been re- tient with TotalRiskScore = 0 (including the baseline offset) ceiving treatment as an outpatient without needing to be has quite high risk: p = 1/(1+exp(−1∗T otalRiskSccore) = hospitalized (the inpatient and ER risk factors for this pa- 1/(1 + exp(0)) = 0.5. Table shows a sample of total risk tient are all low). Patient #3 is a typical patient as far as 30- scores and the corresponding probabilities.
day readmission is considered. They are post-menopausal,have cancers that respond well to treatment if caught early, the treatments themselves are relatively low-risk, and they In the intelligible models discussed in this paper, the av- have not needed unusual medications or to be hospitalized erage risk score for each graph (i.e., each term: each feature often in the last year.
or pair of features) averaged across the training set is set The patients above provide a small glimpse of what the to zero by subtracting the mean score. A single bias term GA2M model learned from a 200,000 patient train set with is then added to the model so that the average predicted 4,000 features: we have only been able to examine three probability across all patients equals the observed baseline patients, and have only looked at the top 6 terms for each rate. This is done to make models identifiable and modular.
of these patients. To a medical expert, the sorted terms Because of this property, each graph can be removed fromthe model (zeroed out) without introducing bias to the pre- 6A large number of visits to the ER also is associated with dictions. If all terms were removed from the model, the only increased chance of readmission, but outpatient visits are remaining term would be the bias term, and the probability more interesting: a small number of recent outpatient visits predicted for all patients would be the observed baseline rate increases risk of readmission, but a very large number of in the training set. Adding terms (graphs) to the model in- outpatient visits (100-200 in the last year) indicates lowerrisk of readmission because the patient is receiving primary creases the model's discriminativeness without altering the care as an outpatient—many of these patients are dialysis prior. This is important because it increases modularity and patients who visit the hospital 1-2 times per week.
makes it easier to interpret the contribution of each term: negative scores decrease risk, and positive scores increase risk compared to the baseline risk.
Sorting Terms by Importance If a model contains a modest number of terms (e.g., less than 50), it is best to show terms in the model to experts in the order they are most familiar with. Because expertsare often used to seeing features in logical groupings, inter- Pneumonia Risk Score pretation is aided by preserving these groupings when themodel is presented. However, when the number of terms grows large, it becomes infeasible for experts to examine all terms carefully. Term importance often follows a power-law distribution, with a few terms being very important, a mod-est number of terms being somewhat important, and manyterms being of little importance. When this is the case, in- telligibility can be improved by sorting terms by a measure of importance such as the drop in AUC when the term is removed, or the skill of the term measured in isolation, or the maximum contribution (positive or negative) that the term can make for any patient. No one measure is corrector best, and we find that a sort that reflects a combination of these metrics seems to work well.
It is much easier to sort terms by importance when making prediction for a single patient: because each term yields a single risk score for each patient at the point where that patient's feature value lies on the term graph, it is possible to sort terms by how much they increase or decrease risk for thepatient. This provides a well-defined ordering of the terms Re-Admission Risk Score for a patient from terms that increase risk most to terms that decrease risk most. Often this ordering quickly identifies the key patient characteristics that best explain the model's prediction, and which help experts quickly understand thepatient's condition. This is the method we used to describethe predictions made by the 30-day readmission model — although that model contains more than 4000 terms, the number of terms that are relevant for each patient are, in practice, often quite small (e.g., less than 100).
(b) 30-day Readmission Feature Shaping vs. Expert Discretization Significant effort was made in the CEHC pneumonia study to train accurate models with logistic regression and othermethods that could not handle continuous attributes. Med- ical experts carefully discretized each continuous attributeinto clinically meaningful ranges used to define boolean vari- ables. For example, the intervals for age were 18-39, 40-54,55-63, 64-68, 69-72, 73-75, 76-79, 80-83, 84-88, and 89+. We Re-Admission Risk Score used these expert-defined intervals for the logistic regression model reported in Table We also trained a GA2M model with these discretized features, and observed a drop in AUCof about 0.01 on the test set compared to the GA2M trained (c) 30-day Readmission (zoomed in) with the continuous features, suggesting that the GA2M modelgains some of its accuracy by shaping continuous features Figure 3: Risk as a function of Age for the Pneumo- more accurately than expert discretization.
nia and 30-day Readmission problems.
Deep Dive: Risk as a Function of Age In this section we drill down on how the feature "Age" outcome is likely to bIn 30-day all-cause readmission, is shaped by the pneumonia and 30-day readmission mod- 7Pneumonia is sometimes called "The Old Man's Best els. Age is present in both data sets and measured in years.
Friend", not because pneumonia is good for elderly patients, But the relevance of age to the two prediction tasks is very but because it often results in rapid death for patients that different. In pneumonia, age is a critical factor that can ex- otherwise could linger for months or years before their pri- plain why a patient has acquired pneumonia, and what the mary illness causes death.
Pneumonia Risk Score Pneumonia Risk Score Pneumonia Risk Score Figure 4: Selected splines in pneumonia dataset.
however, age is just one of thousands of factors that affect older than 85 differently, and that this ultimately increases a patient's health and course of illness. Moreover, because their POD. Or the jump at 86 may be an artifact of the the prediction task is hospital readmission, not probability model — the error bars at age 86 and above are larger. One of death, age represents a weaker, more generic characteriza- approach to investigating this issue further would be to train tion of patient health and their likeliness to need additional on another sample sample of data (or on different subsam- hospitalization within 30 days. If the patient is elderly, but ples) to see if the rise at age 86 persist2) There is an just had a successful hip replacement or kidney stone re- apparent drop in risk above age 100. We suspect that this moved, they are not likely to need to return to the hospital drop probably is not real and may be due to mild overfit- within 30 days for this condition. Similarly, an elderly pa- ting — there are very few patients age 95 and older, and the tient who was admitted to the hospital because of pneumo- error bars from age 90 to 106 are large and consistent with nia, but who is now being released because they responded risk being constant in this region3) Surprisingly, there is to treatment, is unlikely to need further care for pneumonia no evidence that risk, although very high, increases above within 30 days if they take proper medications. All-cause age 85. Either medical treatments are equally effective for readmission is very different from probability of death for a patients older than 85, or other medical conditions are more specific condition such as pneumonia.
likely to be responsible for death at this age than pneumo- Figure a) shows the risk profile for age in the pneumo- nia, or risk does increase above 85 and the model has failed nia model, and the distribution of age in the pneumonia to learn it.
data. The majority of pneumonia patients are age 60-90.
Figure b) shows the age term and density for 30-day Qualitatively, the risk of dying from pneumonia is low and readmission. One of the key differences between the pneu- constant from age 18-50, rises slowly from age 50-66, then monia and 30-day readmission datasets is that pneumonia rises quickly from age 66-90, and then levels off at very high dataset contains only adult patients age 18 and older, but risk above age 90. The low-risk region to the left of age 50 is the readmission dataset contains patients of all age, includ- remarkably flat, suggesting that the underlying trees rarely ing newborn infants. The importance of age to 30-day read- if ever found it useful to split this region into subregions.
mission is very different. Age has little effect on readmission Note that the risk score for this region is -0.27, suggesting between age 2 and 50, risk slowly increases from age 50 to 80, that being young significantly reduces the risk of dying from and then increases a little more above age 80. The largest pneumonia. But risk slowly increases as age increases above increase in score is +0.03 at age 90 and above. There are 50, though the contribution to risk does not become positive many reasons why age is less important for readmission than until about 70 years. Beyond 70 years old, the contribution for pneumonia: most patients independent of age would not to risk rises rapidly from 0.0 at 70 to +0.20 at age 82 and be released if the hospital thought they were likely to need +0.35 at age 86. According to the model, the increase in to be readmitted in less than a month, in this dataset there risk of going from 70 to 86 is larger than the decrease is risk are thousands of other more specific variables that can bet- of going from 70 down to 50 or less.
ter explain variance in the risk of readmission (the model Beyond the risk vs. age profile described above, there are is more illness specific) than age, and some patients who intriguing details in the graph. 1) There is a small jump are very elderly will die at home (either unexpectedly or by in risk at age 67, and again at age 86. The error bars are choice) and thus will not be readmitted.
reasonably tight around age 65-70, suggesting that the jumpin risk at 67 may be real. One possible explanation for thisis that in a dataset from the 90's, many patients would have 8It is only because the model is so intelligible that we are retired at around age 65, and that this may yield differ- able to recognize and question such fine detail in the risk vs.
ences in activity levels, health insurance, and willingness to age profile. We assume that similar jumps in predicted risk get access healthcare early enough to improve outcomes — occur in other accurate models such as boosted trees as well,but because those models are less intelligible the jumps are pneumonia responds well to treatment with antibiotics, but not recognized or investigated.
can be life threatening if not treated. The 2nd jump in risk 9Or it might be due to successful agers, a rare but geneti- around age 86 is harder to explain. It may be that practi- cally identifiable class of people with traits that better en- tioners, either consciously or subconsciously, treat patients able them to survive into old age An interesting feature of the model for 30-day readmis- correlation. Adding causal analysis to the models would be sion is highlighted in Figure c) where the x axis has been tremendously useful, but is, of course, difficult.
expanded to show age 0-2. In this dataset newborn infantsare born into the hospital, and thus will be treated as read- mitted if they need to be hospitalized within 30 days after We present two case studies on real medical data where going home. In part because newborns would not be re- GA2Ms achieve state-of-the-art accuracy while remaining in- leased if they were at risk, the risk score for newborns age telligible. On the pneumonia case study the GA2M model 0-2 months is -0.04—this is a larger negative risk score than learns patterns that previously prevented complex machine the increase in risk for elderly patients. This suggests that learning models from being deployed, but because GA2M is most newborns tend to be healthy when they are released intelligible and modular it is possible to edit the model to re- from the hospital and are less likely to need to be readmitted duce deployment risk. On the larger, more complex 30-day within 30 days. But this reduction in risk from being new- hospital readmission task the GA2M model achieves excel- born diminishes after 2-3 months, and the model suggests lent accuracy while yielding a manageable, surprisingly intel- that infants age 3-15 months have slightly higher positive ligible model despite incorporating over 4000 terms. Using risk of being readmitted to the hospital. Thus infants age 3- this problem we demonstrate how GA2Ms can be used to 15 months have higher risk of readmission than infants that explain the predictions the model makes for individual pa- are younger or older, and it is not until age 45 that the risk tients in a concise way that places focus on the most impor- of readmission rises to this level again.
tant/relevant terms for each patient. We believe GA2Ms rep-resent a significant step forward in the tradeoff between Shaping with Splines model accuracy and intelligibility that should make it easier Generalized additive models are often fit with splines .
to deploy high-accuracy learned models in applications such Splines allow GAMs to be trained with careful control over as healthcare where model verification and debuggability are regularization and provide more principled error bars. Un- as important as accuracy.
fortunately, the spline methods tend to over regularize, yield Acknowledgements. We thank Michael Fine, MD, Uni- less accuracy than GA2M models, and yield risk profiles that versity of Pittsburgh School of Medicine, and Greg Cooper, sometimes miss detail discovered by GA2M models. Figure MD, PhD, University of Pittburgh for help with the Pneu- shows three terms from a spline GAM model trained on the monia data and model. We thank Eric Horvitz, MD, PhD, pneumonia data. The 1st term is age, the 2nd is pH, and the Microsoft Research Redmond for help with the 30-day hos- 3rd is temperature. Although the splines capture the basic pital readmission data and model. The 30-day hospital read- trends (e.g., risk increases with age, pH risk is least around mission experiment was reviewed and approved by the insti- 7.6, and fever risk rises above 40◦C), the splines miss detail tutional review board at Columbia University Medical Cen- learned by GA2M. For example, the GA2M model for age is much more nuanced, and the spline model may not prop-erly model temperature in the normal range 36◦C-38◦C. Thespline GAM model has accuracy closer to logistic regression than GA2M, so the extra detail learned by GA2M increases [1] R. Ambrosino, B. Buchanan, G. Cooper, and M. Fine.
accuracy and probably reflects genuine structure.
The use of misclassification costs to learn rule-baseddecision support models for cost-effective hospital Correlation Does Not Imply Causation admission strategies. In Proceedings of the AnnualSymp. on Comp. Application in Medical Care, 1995.
Because the models in this paper are intelligible, it is [2] G. Cooper, V. Abraham, C. Aliferis, J. Aronis, tempting to interpret them causally. Although the models B. Buchanan, R. Caruana, M. Fine, J. Janosky, accurately explain the predictions they make, they are still G. Livingston, T. Mitchell, S. Montik, and P. Spirtes.
based on correlation. If features were added to or subtracted Predicting dire outcomes of patients with community and the model retrained, the graphs for some terms that had acquired pneumonia. Journal of Biomedical remained in the model would change because of correlation Informatics, 38(5):347–366, 2005.
with the features added or subtracted. Although details of [3] G. Cooper, C. Aliferis, R. Ambrosino, J. Aronis, some of the shape plots are suggestive (e.g., does pneumonia B. Buchanan, R. Caruana, M. Fine, C. Glymour, risk truly jump as age increases above 65, and again above G. Gordon, B. Hanusa, J. Janosky, C. Meek, 85?), it is not (yet) clear if some details like this are due to a) T. Mitchell, T. Richardson, and P. Spirtes. An overfitting; b) correlation with other variables; c) interaction evaluation of machine-learning methods for predicting with other variables; d) correlation or interaction with un- pneumonia mortality. Artificial Intelligence in measured variables; or e) due to true underlying phenomena Medicine, 9(2):107–138, 1997.
such as retirement and change in insurance provider.
[4] T. Hastie and R. Tibshirani. Generalized additive Perhaps the strongest statement we can make right now is models. Chapman & Hall/CRC, 1990.
that the models are intelligible enough to provide a windowinto the data and prediction problem that is missing with [5] Y. Lou, R. Caruana, and J. Gehrke. Intelligible models many other learning methods, and that this window allows for classification and regression. In KDD, 2012.
questions to be raised that will require investigation and [6] Y. Lou, R. Caruana, J. Gehrke, and G. Hooker.
further data analysis to answer. In future versions of these Accurate intelligible models with pairwise interactions.
models we hope to automate some of these analyses so that In KDD, 2013.
it is clearer what features in the intelligible model are "real" [7] S. Wood. Generalized additive models: an introduction or due to random factors such as overfitting and spurious with R. CRC Press, 2006.


Lampbrush chromosomes of the chicken:

Lampbrush Chromosomes of the Chicken: Cytological Maps of Macrobivaients L.A. Chelysheva, I.V. Solovei, A.V. Rodionov, A.F. Yakovlev, E.R. Biological Research Institute. Leningrad University and the All-Union Research Institute of Farm Animal Breeding and Genetics, Leningrad The morphology of lampbrush chromosomes of the chicken has been studied. We identified six pairs of autosome bivalents and the sexual ZW-bivalent. The cytological map of the lampbrush macrobivalents have been constructed.

Weakening and strengthening of the indian monsoon during heinrich events and dansgaardoeschger oscillations

Weakening and strengthening of the Indian monsoon during Heinrich events and Dansgaard-Oeschger oscillations • Intensity of Indian monsoon is traced with geochemical and grain size Gaudenz Deplazes1, Andreas Lückge2, Jan-Berend W. Stuut3,4, Jürgen Pätzold3, Holger Kuhlmann3, analyses of sediments Dorothée Husson3, Mara Fant3, and Gerald H. Haug1,5 • Fluvial versus aeolian sediment input