Logistic Regression Analysis

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1038078 Experts worldwide ranked by ideXlab platform

Ewout W Steyerberg - One of the best experts on this subject based on the ideXlab platform.

  • polytomous Logistic Regression Analysis could be applied more often in diagnostic research
    Journal of Clinical Epidemiology, 2008
    Co-Authors: Ewout W Steyerberg, Cornelis J Iesheuvel, Yvonne Vergouwe, Diederick E Grobbee, Karel G M Moons
    Abstract:

    Objective: Physicians commonly consider the presence of all differential diagnoses simultaneously. Polytomous Logistic Regression modeling allows for simultaneous estimation of the probability of multiple diagnoses. We discuss and (empirically) illustrate the value of this method for diagnostic research. Study Design and Setting: We used data from a study on the diagnosis of residual retroperitoneal mass histology in patients presenting with nonseminomatous testicular germ cell tumor. The differential diagnoses include benign tissue, mature teratoma, and viable cancer. Probabilities of each diagnosis were estimated with a polytomous Logistic Regression model and compared with the probabilities estimated from two consecutive dichotomous Logistic Regression models. Results: We provide interpretations of the odds ratios derived from the polytomous Regression model and present a simple score chart to facilitate calculation of predicted probabilities from the polytomous model. For both modeling methods, we show the calibration plots and receiver operating characteristics curve (ROC) areas comparing each diagnostic outcome category with the other two. The ROC areas for benign tissue, mature teratoma, and viable cancer were similar for both modeling methods, 0.83 (95% confidence interval [CI] = 0.80-0.85) vs. 0.83 (95% CI = 0.80-0.85), 0.78 (95% CI = 0.75-0.81) vs. 0.78 (95% CI = 0.75-0.81), and 0.66 (95% CI = 0.61-0.71) vs. 0.64 (95% CI = 0.59-0.69), for polytomous and dichotomous Regression models, respectively. Conclusion: Polytomous Logistic Regression is a useful technique to simultaneously model predicted probabilities of multiple diagnostic outcome categories. The performance of a polytomous prediction model can be assessed similarly to a dichotomous Logistic Regression model, and predictions by a polytomous model can be made with a user-friendly method. Because the simultaneous consideration of the presence of multiple (differential) conditions serves clinical practice better than consideration of the presence of only one target condition, polytomous Logistic Regression could be applied more often in diagnostic research.

  • internal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • original articleinternal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • prognostic modeling with Logistic Regression Analysis in search of a sensible strategy in small data sets
    Medical Decision Making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical “shrinkage” techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more th...

  • Prognostic modeling with Logistic Regression Analysis: in search of a sensible strategy in small data sets.
    Medical decision making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, J. Dik F. Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical "shrinkage" techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more than half of these were randomly associated with the outcome. Using qualitative information on the sign of the effect of predictors slightly improved the predictive ability. Calibration improved when shrinkage was applied on the standard maximum likelihood estimates of the Regression coefficients. In conclusion, a sensible strategy in small data sets is to apply shrinkage methods in full models that include well-coded predictors that are selected based on external information.

Marinus J C Eijkemans - One of the best experts on this subject based on the ideXlab platform.

  • no rationale for 1 variable per 10 events criterion for binary Logistic Regression Analysis
    BMC Medical Research Methodology, 2016
    Co-Authors: Maarte Van Smede, Marinus J C Eijkemans, Joris A H De Groo, Karel G M Moons, Gary S Collins, Douglas G Altma, Johannes Reitsma
    Abstract:

    Ten events per variable (EPV) is a widely advocated minimal criterion for sample size considerations in Logistic Regression Analysis. Of three previous simulation studies that examined this minimal EPV criterion only one supports the use of a minimum of 10 EPV. In this paper, we examine the reasons for substantial differences between these extensive simulation studies. The current study uses Monte Carlo simulations to evaluate small sample bias, coverage of confidence intervals and mean square error of logit coefficients. Logistic Regression models fitted by maximum likelihood and a modified estimation procedure, known as Firth’s correction, are compared. The results show that besides EPV, the problems associated with low EPV depend on other factors such as the total sample size. It is also demonstrated that simulation results can be dominated by even a few simulated data sets for which the prediction of the outcome by the covariates is perfect (‘separation’). We reveal that different approaches for identifying and handling separation leads to substantially different simulation results. We further show that Firth’s correction can be used to improve the accuracy of Regression coefficients and alleviate the problems associated with separation. The current evidence supporting EPV rules for binary Logistic Regression is weak. Given our findings, there is an urgent need for new research to provide guidance for supporting sample size considerations for binary Logistic Regression Analysis.

  • internal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • original articleinternal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • prognostic modeling with Logistic Regression Analysis in search of a sensible strategy in small data sets
    Medical Decision Making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical “shrinkage” techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more th...

  • Prognostic modeling with Logistic Regression Analysis: in search of a sensible strategy in small data sets.
    Medical decision making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, J. Dik F. Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical "shrinkage" techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more than half of these were randomly associated with the outcome. Using qualitative information on the sign of the effect of predictors slightly improved the predictive ability. Calibration improved when shrinkage was applied on the standard maximum likelihood estimates of the Regression coefficients. In conclusion, a sensible strategy in small data sets is to apply shrinkage methods in full models that include well-coded predictors that are selected based on external information.

Dik J F Habbema - One of the best experts on this subject based on the ideXlab platform.

  • internal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • original articleinternal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • prognostic modeling with Logistic Regression Analysis in search of a sensible strategy in small data sets
    Medical Decision Making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical “shrinkage” techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more th...

  • prognostic modelling with Logistic Regression Analysis a comparison of selection and estimation methods in small data sets
    Statistics in Medicine, 2000
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Logistic Regression Analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of Regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels α of 0.01, 0.05, 0.157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of Regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable Regression coefficients. We found that stepwise selection with a low α (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where Regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets. Copyright © 2000 John Wiley & Sons, Ltd.

  • prognostic modelling with Logistic Regression Analysis a comparison of selection and estimation methods in small data sets
    Statistics in Medicine, 2000
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Logistic Regression Analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of Regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels alpha of 0.01, 0.05, 0. 157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of Regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable Regression coefficients. We found that stepwise selection with a low alpha (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where Regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets.

J. Dik F. Habbema - One of the best experts on this subject based on the ideXlab platform.

  • Prognostic modeling with Logistic Regression Analysis: in search of a sensible strategy in small data sets.
    Medical decision making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, J. Dik F. Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical "shrinkage" techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more than half of these were randomly associated with the outcome. Using qualitative information on the sign of the effect of predictors slightly improved the predictive ability. Calibration improved when shrinkage was applied on the standard maximum likelihood estimates of the Regression coefficients. In conclusion, a sensible strategy in small data sets is to apply shrinkage methods in full models that include well-coded predictors that are selected based on external information.

  • prognostic models based on literature and individual patient data in Logistic Regression Analysis
    Statistics in Medicine, 2000
    Co-Authors: Ewout W Steyerberg, Marinus J C Eijkemans, Kuk Lida Lee, J C Van Houwelinge, J. Dik F. Habbema
    Abstract:

    Prognostic models can be developed with multiple Regression Analysis of a data set containing individual patient data. Often this data set is relatively small, while previously published studies present results for larger numbers of patients. We describe a method to combine univariable Regression results from the medical literature with univariable and multivariable results from the data set containing individual patient data. This ‘adaptation method’ exploits the generally strong correlation between univariable and multivariable Regression coefficients. The method is illustrated with several Logistic Regression models to predict 30-day mortality in patients with acute myocardial infarction. The Regression coefficients showed considerably less variability when estimated with the adaptation method, compared to standard maximum likelihood estimates. Also, model performance, as distinguished in calibration and discrimination, improved clearly when compared to models including shrunk or penalized estimates. We conclude that prognostic models may benefit substantially from explicit incorporation of literature data. Copyright © 2000 John Wiley & Sons, Ltd.

  • Stepwise selection in small data sets: a simulation study of bias in Logistic Regression Analysis
    Journal of Clinical Epidemiology, 1999
    Co-Authors: Ewout W Steyerberg, Marinus J C Eijkemans, J. Dik F. Habbema
    Abstract:

    Stepwise selection methods are widely applied to identify covariables for inclusion in Regression models. One of the problems of stepwise selection is biased estimation of the Regression coefficients. We illustrate this "selection bias" with Logistic Regression in the GUSTO-I trial (40,830 patients with an acute myocardial infarction). Random samples were drawn that included 3, 5, 10, 20, or 40 events per variable (EPV). Backward stepwise selection was applied in models containing 8 or 16 pre-specified predictors of 30-day mortality. We found a considerable overestimation of Regression coefficients of selected covariables. The selection bias decreased with increasing EPV. For EPV 3, 10, or 40, the bias exceeded 25% fur 7, 3, and 1 in the 8-predictor model respectively, when a conventional selection criterion was used (alpha = 0.05). For these EPV values, the bias was less than 20% for all covariables when no selection was applied. We conclude that stepwise selection may result in a substantial bias of estimated Regression coefficients.

Frank E. Harrell - One of the best experts on this subject based on the ideXlab platform.

  • internal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • original articleinternal validation of predictive models efficiency of some procedures for Logistic Regression Analysis
    Journal of Clinical Epidemiology, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Yvonne Vergouwe, Gerard J J M Borsboom, Dik J F Habbema
    Abstract:

    The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. We evaluated several variants of split-sample, cross-validation and bootstrapping methods with a Logistic Regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples with a size between n = 572 and n = 9165 were drawn from a large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in data sets with between 5 and 80 events per variable. Independent performance was determined on the remaining subjects. Performance measures included discriminative ability, calibration and overall accuracy. We found that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. We conclude that split-sample validation is inefficient, and recommend bootstrapping for estimation of internal validity of a predictive Logistic Regression model.

  • prognostic modeling with Logistic Regression Analysis in search of a sensible strategy in small data sets
    Medical Decision Making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical “shrinkage” techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more th...

  • Prognostic modeling with Logistic Regression Analysis: in search of a sensible strategy in small data sets.
    Medical decision making, 2001
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, J. Dik F. Habbema
    Abstract:

    Clinical decision making often requires estimates of the likelihood of a dichotomous outcome in individual patients. When empirical data are available, these estimates may well be obtained from a Logistic Regression model. Several strategies may be followed in the development of such a model. In this study, the authors compare alternative strategies in 23 small subsamples from a large data set of patients with an acute myocardial infarction, where they developed predictive models for 30-day mortality. Evaluations were performed in an independent part of the data set. Specifically, the authors studied the effect of coding of covariables and stepwise selection on discriminative ability of the resulting model, and the effect of statistical "shrinkage" techniques on calibration. As expected, dichotomization of continuous covariables implied a loss of information. Remarkably, stepwise selection resulted in less discriminating models compared to full models including all available covariables, even when more than half of these were randomly associated with the outcome. Using qualitative information on the sign of the effect of predictors slightly improved the predictive ability. Calibration improved when shrinkage was applied on the standard maximum likelihood estimates of the Regression coefficients. In conclusion, a sensible strategy in small data sets is to apply shrinkage methods in full models that include well-coded predictors that are selected based on external information.

  • prognostic modelling with Logistic Regression Analysis a comparison of selection and estimation methods in small data sets
    Statistics in Medicine, 2000
    Co-Authors: Ewout W Steyerberg, Frank E. Harrell, Marinus J C Eijkemans, Dik J F Habbema
    Abstract:

    Logistic Regression Analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of Regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels α of 0.01, 0.05, 0.157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of Regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable Regression coefficients. We found that stepwise selection with a low α (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where Regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets. Copyright © 2000 John Wiley & Sons, Ltd.