SCCM
Log In
Forgot username or password?
New User? Sign Up Free.

Using Multiple Imputation to Avoid Bias From Missing Data in Critical Care Research

Todd A. Miano, PharmD, MSCE

Missing data is a common, yet often overlooked, source of bias in critical care studies.1,2 A recent survey showed that 50% of studies published in three major critical care journals had missing data, yet only two studies employed appropriate missing data methods.1 The key concern with missing data is informative missingness, meaning that data are missing for reasons that are related to the study outcome.2 Informative missingness can create selection bias when using complete case analysis (i.e., restricting analysis to subjects with complete data) because the outcome patterns in subjects with complete data may differ from those with missing data.2,3 The extent of bias can be substantial, especially if there are large amounts of missing data.2,3 Complete case analysis can also result in a significant loss of sample size.2,3

Multiple imputation (MI) is a powerful alternative to complete case analysis that has several advantages.2–5 MI utilizes the entire data set, can be applied to any variable type (binary, continuous, etc.), and can substantially reduce missing data bias.2-5 MI utilizes information from the observed covariate and outcome data to predict missing values. This article will provide a brief introduction to the framework for missing data analysis, some insight into how MI works, and guidance for implementation.

Missing Data Mechanisms and Bias
Understanding missing data and the MI approach requires a basic understanding of the mechanisms of missingness. The framework for understanding missing data was established by Rubin in a seminal 1976 paper.6 Rubin starts with three plausible assumptions about the mechanisms leading to missing data.

The first assumption is called missing completely at random (MCAR). Missing data variable X is MCAR if the probability of being missing is unrelated to the value of X itself or the value of any other variables in the data set.3,6 When MCAR holds, the sample with complete data can be viewed as a simple random sample of the full data set.3,6 This assumption might be plausible when data are missing accidentally (e.g., when data collection forms are lost or measurement devices malfunction) or when the data are missing by design (e.g., planned genetic analysis only in a randomly selected subset of the population).3,6 Unfortunately, the MCAR assumption is usually implausible outside of these specific examples.

More commonly, data are missing through mechanisms related to patient characteristics. In this case, a plausible assumption may be that the data are missing at random (MAR). Missing data variable X is MAR if the likelihood of being missing is unrelated to the value of X after controlling for observed covariates available in the data set.3,6 This implies that observed values might differ systematically from missing values. However, controlling for observed covariate data corrects the differences.3,6 There are many instances where this assumption might be valid. For example, in a study of acute kidney injury, missing baseline creatinine values may be lower than measured creatinine values only because younger patients without kidney disease are more likely to have missing creatinine measurements. In this scenario, knowing a patient’s age and kidney disease status provides information on what the creatinine value would be if it were observed. This information can be used to impute the missing value (i.e., make an educated guess). This assumption underlies the logic of MI analysis.

The worst-case scenario is when missing data are systematically different from observed values for unknown reasons, called missing not at random (MNAR). Missing data variable X is MNAR if the probability of being missing is related to the value of X, even after controlling for other variables in the analysis.3,6 For example, in a study of hypertension, missing blood pressure measurements would be MNAR if they were missing because they were abnormally high, leading to symptoms (e.g., headaches) that caused the patients to miss clinic visits.4 In this scenario, the reason the values are missing (because they are abnormally high) is also unobserved. Thus, there is no way to utilize the information to impute the missing value. Bias from this mechanism cannot be controlled with MI analysis.3,6

How Multiple Imputation Works
The framework of MI is depicted in Figure 1. Two steps characterize the method: 1) imputation of the missing values, and 2) estimation of the outcome parameters.3,7 The imputation step begins with the subset of complete data. In this subset, the missing data are predicted (i.e., imputed) based on the other covariates in the data set with a regression model. This single imputation produces a full data set with no missing values. If the MAR assumption holds, the single imputation procedure will provide unbiased estimates of the missing data values.3,7 Single imputation assumes that there is no uncertainty in the imputed value, which produces erroneously narrow confidence intervals and potentially exaggerated statistical significance.3,7 Repeating the imputation many times solves this problem, hence the term multiple imputation. Each time the imputation cycle repeats, a slightly different value is produced in a random fashion. This additional random variability simulates the uncertainty that naturally exists in any measurement.3,7 Once multiple copies of the original data set have been produced, the next step is to estimate the primary outcome parameter (e.g., a relative risk or risk difference).

In the estimation step, a single estimate is obtained from the multiple copies of the data set. This begins by first conducting the outcome analysis in each data set separately, resulting in multiple versions of the outcome model. The individual parameter estimates and corresponding standard errors from each data set are then averaged to obtain the final parameter estimate.3,7 These steps are implemented automatically in modern statistical packages8 and can be applied to any statistical model (e.g., logistic regression, Cox regression).

Practical Recommendations MI analysis is valid only if the MAR assumption is correct, meaning that the missing values can be accurately predicted based on the observed covariate data. Thus, careful consideration of the mechanisms leading to missing data is essential. In the planning stages, researchers should consider likely mechanisms leading to missing data and plan to collect covariate information that would be predictive of the missing values. For example, in a study of acute kidney injury, if patients admitted from an outside hospital have a higher likelihood of missing baseline creatinine values, data on the type of admission should be collected for all patients. Another important decision is the number of imputations. Imputing five data sets is adequate in most situations,3,7 although the optimal number increases with an increasing amount of missing data. The only downside to using a larger number is computation time, which is less of a concern with modern computers. For this reason, some software packages recommend using at least 20 imputations.8 For more information on MI implementation, see references 2–8.

Summary
Missing data are a key threat to the validity of results from critical care research. Standard methods (e.g., complete case analysis) rest on often implausible assumptions and are frequently biased. MI is a powerful method that can minimize bias in many instances. MI is available in most common statistical analysis packages and should enjoy more widespread use.

References

  1. Vesin A, Azoulay E, Ruckly S, et al. Reporting and handling missing values in clinical studies in intensive care units. Intensive Care Med. 2013 Aug;39(8):1396-1404.
  2. Perkins NJ, Cole SR, Harel O, et al. Principled approaches to missing data in epidemiologic studies. Am J Epidemiol. 2018 Mar 1;187(3):568-575.
  3. Allison PD. Missing Data. Thousand Oaks, CA: SAGE Publications; 2002.
  4. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009 Jun 29;338:b2393.
  5. Harel O, Mitchell EM, Perkins NJ, et al. Multiple imputation for incomplete data in epidemiologic studies. Am J Epidemiol. 2018 Mar 1;187(3):576-584.
  6. Rubin DB. Inference and missing data. Biometreka. 1976 Dec;63(3):581-592.
  7. Rubin DB: Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley; 1987.
  8. StataCorp. Multiple Imputation Reference Manual. Release 2011. College Station, TX: Stata Press; 2009.
  9. Vesin A, Azoulay E, Ruckly S, et al. Reporting and handling missing values in clinical studies in intensive care units. Intensive Care Med. 2013 Aug;39(8):1396-1404.
  10. Perkins NJ, Cole SR, Harel O, et al. Principled approaches to missing data in epidemiologic studies. Am J Epidemiol. 2018 Mar 1;187(3):568-575.
  11. Allison PD. Missing Data. Thousand Oaks, CA: SAGE Publications; 2002.
  12. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009 Jun 29;338:b2393.
  13. Harel O, Mitchell EM, Perkins NJ, et al. Multiple imputation for incomplete data in epidemiologic studies. Am J Epidemiol. 2018 Mar 1;187(3):576-584.
  14. Rubin DB. Inference and missing data. Biometreka. 1976 Dec;63(3):581-592.
  15. Rubin DB: Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley; 1987.
  16. StataCorp. Multiple Imputation Reference Manual. Release 2011. College Station, TX: Stata Press; 2009.