A total of 963 E. coli UTI patients from NCKUH were included, 14.2% of them had E. coli RUTI. All the 137 RUTI patients included in this study had RUTI caused by E. coli, 74 patients (54%) had 2 episodes of UTI within 6months and 63 patients (46%) had 3 episodes of UTI within 12months. All these episodes of E. coli related RUTI in this study were reinfection (recurrence of UTI with the same organisms in more than 2weeks). The duration of antibiotic treatment varied from 3 to 14days, and the antibiotic regimens included empirical antibiotic therapy and definitive antibiotic therapy according to the antimicrobial susceptibility test. The patient characteristics related to UTI and RUTI caused by E. coli are shown in Table 1. The median age was 67 and 75years for patients with UTI and RUTI, respectively. Compared to the UTI group, patients with RUTI had an older age, a greater prevalence of diabetes mellitus, liver cirrhosis, indwelling Foley catheter, neurogenic bladder, more frequent hospitalization/emergency department (ED) visit/UTI within 2years and any UTI symptom, and a worse renal function (Table 1).
The bacterial characteristic factors (phylogenicity, virulence genes, and antimicrobial susceptibility) related to UTI and RUTI are shown in Tables 2 and 3, respectively. Compared to those in the UTI group, E. coli isolates derived from the RUTI group had a lower prevalence of papG II, usp, ompT, and sat genes, and a higher prevalence of antimicrobial resistance in several antibiotics (including cefazolin, cefuroxime, cefixime, and levofloxacin).
The analysis results suggested RF model was better than the LR and DT model for RUTI prediction in the clinical visit. The 32 factors considered in the models for the first stage were age, gender, comorbidities (Dis1~Dis12), UTI symptoms (UTI_symptom1~UTI_symptom8), serum creatinine, frequency of hospitalization/emergency department (ED) visit/UTI within 2years (Pre_hos_2y, Pre_UTI_ER_2y, Pre_UTI_hos_2y), urinary red blood cell/HPF (URBC_level), urinary white blood cell (WBC)/high power field (HPF) (UWBC_level), urinary bacterial count (UBact), peak blood WBC count (BloodWBC), place (outpatient or ED) of urine sample collection (Place_of_collection), and disease group (four_disease_group). These factors are labeled in Table 1.
URBC_level and UWBC_level represent the rescaled level of the URBC and UWBC with values from 0 to 4 and from 1 to 4, respectively. The values 0, 1, 2, 3, and 4 of the URBC_level and UWBC_level corresponded to the ranges 0, 1~10, 11~100, 101~1000, and greater than 1000 per HPF, respectively. Place_of_collection indicates the place of urine sample collection, including outpatient clinic and ED. A new factor called four_disease_group was defined for RUTI prediction with value 0 or 1. We set four_disease_group value to 1 when one of the following diseases with anatomical or functional defect of urinary tract is present: indwelling Foley catheter (Dis5), obstructive uropathy (Dis6), urolithiasis (Dis7), and neurogenic bladder (Dis9). We would like to confirm the relation of four_disease_group with RUTI.
Regarding the validation results of fitted models to predict the development of RUTI in the clinical visit, Table 4 shows that the mean validation accuracy of RF is 0.700 which is higher than the results of LR and DT. The mean validation sensitivity and specificity of RF are 0.626 and 0.712, respectively. The standard deviations of estimated validation accuracy, sensibility, and specificity are 0.039, 0.131, and 0.046, respectively, which support the stability of RF model prediction. Note that the RUTI rate is only 136/963=0.138 which is relatively low for the observed samples. A nave model would predict non of the patients to have RUTI with a high accuracy 827/963=0.862. However, such prediction will lead to a very poor sensitivity with value 0. The RF model avoided such serious bias and provided a balance prediction capability in both sensitivity and specificity. The key technique in the RF model training is the usage of upsampling.
Variable importance in RF is evaluated by the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. For example, if the age is taken away, the model prediction will reduce the accuracy rate by 11.9%. Figure1 is the variable importance plot of the RF analysis and shows that age, cirrhosis (Dis4), diabetes mellitus (Dis1), and disease group (four_disease_group) are the most important factors to predict recurrence of UTI in the clinical visit. Each of the 4 factors contributed around 10% prediction accuracy in the RF model.
Variable importance plot of the first stage RF analysis in percentage of mean decrease accuracy for the factors. It shows that age, cirrhosis (Dis4), diabetes mellitus (Dis1), and disease group (four_disease_group) are the most important 4 factors to predict recurrence in the clinical visit (sample size = 963).
A DT model is able to construct the decision rules for RUTI classification and provides the order of importance of the factors at the same time. Table 4 shows that the mean validation accuracy, sensitivity, and specificity of DT model are 0.654, 0.618, and 0.660, respectively. Although the validation accuracy of the DT is less than the values of the RF model, the results of DT model has its own edge in decision rule construction.
To obtain more insight on the RUTI factors in the clinical visit, one can check on Fig.2 which is the decision rules of the DT model built from all the 963 patients. The purpose of building a DT model with all collected data is to construct the decision rules for RUTI classification. In a DT model, when the patients satisfy the node's condition, the patients will be allocated to the left path of the node, otherwise the patients will be allocated to the right path of the node. The classification accuracy of this tree is 0.88, and the sensitivity and specificity are 0.26 and 0.98, respectively. Although the sensitivity is low due to the unbalanced rates of RUTI and UTI in the DT model, there are several valuable rules for RUTI classification. The 2 green boxes and 1 red box in Fig.2 indicate the nodes of the decision rules with a accuracy rate higher than 0.85 and 0.70 for non RUTI and RUTI classification, respectively. The three decision rules are:
When the factor states of a patient are without neurogenic bladder (Dis9=0) and without hospitalized within 2years (Pre_hos_2y<1), this rule claims that the patient will have no RUTI with classification accuracy 439/(439+34)=0.92.
When the factor states of a patient are without neurogenic bladder (Dis9=0), with previous hospitalization at least one time within 2years (Pre_hos_2y>=1), with serum creatinine less than 0.93mg/dL (creatinine<0.93), without cirrhosis (Dis4=0), and previous ER for UTI less than two times within 2years (Pre_UTI_ER_2y<2), this rule claims that the patient will have no RUTI with classification accuracy 296/(296+46)=0.86.
When the factor states of a patient are without neurogenic bladder (Dis9=0), with previous hospitalization at least one time within 2years (Pre_hos_2y>=1), with serum creatinine in the range between 0.74 and 3.9mg/dL (0.74 The decision rules of the DT analysis for development of RUTI in the clinical visit. (sample size = 963). The 2 green boxes and 1 red box indicate the nodes of the decision rules with an accuracy rate higher than 0.85 and 0.70 for non RUTI and RUTI classification, respectively. The analysis results suggested RF model was better than the LR and DT model for RUTI prediction after hospitalization. The 62 factors considered in the models for the second stage not only contain the 32 factors used in the first stage analysis, but also include phylogenicity, 16 virulence genes, 11 antimicrobial susceptibility, Bacterial_Name, UTI_pos, Hospitalday, and Place_of_collection. The genes and antimicrobial are labeled in Table 2. Bacterial_name indicates Escherichia coli with or without extended spectrum -lactamase (ESBL). UTI_pos represents the location of urinary tract infection. Hospital_day gives the length (day) of hospital stay. Place_of_collection records the place of sample collection at ER, hospital, or outpatient clinic. Regarding the validation results of refitted models to predict the development of RUTI after hospitalization, Table 5 shows that the mean validation accuracy of RF is 0.709 which is higher than the results of LR and DT. The mean validation sensitivity and specificity of RF are 0.620 and 0.722, respectively. The standard deviations of estimated validation accuracy, sensibility, and specificity are 0.047, 0.057, and 0.058, respectively, which support the stability of RF model prediction. Note that the RUTI rate is only 112/809=0.138 which is relatively low for the observed samples. A nave model would predict non of the patients to have RUTI with a high accuracy 697/809=0.862. However, such prediction will lead to a very poor sensitivity with value 0. The RF model avoided such serious bias and provided a balance prediction capability in both sensitivity and specificity. Variable importance plot shows that based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. For example, if the cefixime (Anti7) is taken away, the model prediction will reduce the accuracy rate by 9.14%. Figure3 is the variable importance plot of the RF analysis and shows that cefixime (Anti7), afa (Gene11), usp (Gene8), and cefazolin (Anti5) are important factors to predict recurrence after hospitalization. Each of the 4 factors contributed more than 8% prediction accuracy in the RF model. Variable importance plot of the second stage RF analysis in percentage of mean decrease accuracy for the factors. It shows that cefixime (Anti7), afa (Gene11), usp (Gene8), and cefazolin (Anti5) are important factors to predict recurrence after hospitalization (sample size = 809). To obtain more insight on the RUTI factors after hospitalization, one can check on Fig.4 which is the decision rules of the DT model built from all the 803 patients. The classification accuracy of this tree is 0.89, and the sensitivity and specificity are 0.27 and 0.99, respectively. Although the sensitivity is low due to the unbalanced rates of RUTI and UTI in the DT model, there are several valuable rues for RUTI classification. The 4 green boxes and 3 red boxes in Fig.4 indicate the nodes of the decision rules with an accuracy rate higher than 0.85 and 0.70 for non RUTI and RUTI classification, respectively. The 7 decision rules are: When the factor states of a patient are bacterial phylogenetic group B2 (Gene17=3) and the age less than 76years old (Age<76), this rule claims that the patient will have no RUTI with classification accuracy 322/(322+18)=0.94. When the factor states of a patient are bacterial phylogenetic group B2 (Gene17=3), the age over 76years old (Age (ge) 76), and serum creatinine less than 3.5mg/dL (creatinine<3.5), this rule claims that the patient will have no RUTI with classification accuracy 148/(148+21)=0.87. When the factor states of a patient are bacterial phylogenetic group B2 (Gene17=3), the age over 76years old (Age (ge) 76), serum creatinine less than 3.5mg/dL (creatinine (ge) 3.5), and more than 19days of hospital stay (Hospital_day (ge) 19), this rule claims that the patient will have RUTI with classification accuracy 8/(3+8)=0.72. When the factor states of a patient are non-group B2 in bacterial phylogenicity (Gene17 (ne) 3) and S or I type in levofloxacin susceptibility (Anti25=1, 2), this rule claims that the patient will have no RUTI with classification accuracy 137/(137+22)=0.86. When the factor states of a patient are non-group B2 in bacterial phylogenicity (Gene17 (ne) 3), R type in levofloxacin susceptibility (Anti25=3), bloodWBC more than 7.8 (bloodWBC (ge) 7.8), and group A or B1 in bacterial phylogenicity (Gene17=1, 2), this rule claims that the patient will have no RUTI with classification accuracy 42/(42+5)=0.89. When the factor states of a patient are non-group B2 in bacterial phylogenicity (Gene17 (ne) 3), R type in levofloxacin susceptibility (Anti25=3), bloodWBC more than 7.8 (bloodWBC (ge) 7.8), group D in phylogenicity (Gene17=4), and more than 57days of hospital stay (Hospital_day (ge) 57), this rule claims that the patient will have RUTI with classification accuracy 6/(6+1)=0.85. When the factor states of a patient are non-group B2 in bacterial phylogenicity (Gene17 (ne) 3), R type in levofloxacin susceptibility (Anti25=3), bloodWBC less than 7.8 (bloodWBC<7.8), and the value of UWBC more than 10 (UWBC_level (ne) 1), this rule claims that the patient will have RUTI with classification accuracy 16/(6+16)=0.72. The decision rules of the DT analysis for development of RUTI after hospitalization. The 4 green boxes and 3 red boxes indicate the nodes of the decision rules with an accuracy rate higher than 0.85 and 0.70 for non RUTI and RUTI classification, respectively (sample size = 809). Follow this link:
Machine learning to predict the development of recurrent urinary tract infection related to single uropathogen, Escherichia coli | Scientific Reports...