Introduction
Many real-world datasets contain missing values for various reasons. We usually have a few options to deal with these missing values. The easiest solution is to remove all rows from the data set where one or more variables are missing. However, if values are not missing completely at random, this will likely introduce bias in our analysis. So we usually want to impute missing values one way or another. Here, we will consider two very general approaches usingbrms: (1) Assign missing valuesbeforemodel fitting with multiple imputation and (2) imputation of missing values in real timeduringmodel fit1. As a simple example, we will usenhanes
dataset, which contains information about the participantsage
,IMC
(body mass index),hip
(hypertensive) andchl
(total serum cholesterol). For the purposes of this vignette, we are primarily interested in forecastingIMC
withage
mchl
.
data("nhanes",package = "rats")head(nhanes)
idade IMC hip chl1 1 NA NA NA2 22,7 1 1873 1 NA 1 1874 3 NA NA NA5 1 20,4 1 1136 3 NA NA 184
Imputation before fitting the model
There are several approaches that allow us to impute missing data before the actual model fitting takes place. From a statistical point of view, multiple imputation is one of the best solutions. Each missing value is not imputed once, butM
times leading to a totalM
fully imputed data sets. The model can then be fitted to each of these datasets separately and the results aggregated into models later. A widely implemented package for multiple imputation israts(Buuren & Groothuis-Oudshoorn, 2010) and will then use it in conjunction withbrms.Here we apply its default settingsrats, which means that all variables will be used to impute missing values in all other variables and the imputation functions are automatically selected based on the characteristics of the variables.
library(rats)HELL BOY<- rats(nhanes,m = 5,print = FALSE)
Now, we havem = 5
imputed datasets stored inHELL BOY
object. In practice, we will probably need more than5
of them to accurately account for the uncertainty caused by the fault, perhaps even in its area100
imputed datasets (Zhou & Reiter, 2010). Of course, this greatly increases the computational load, so we keep itm = 5
for the purposes of this vignette. regardless of valueM
, we can extract these datasets and then pass them to the actual model fitting function as a list of datasets or passesHELL BOY
directly. The latter works becausebrmsoffers special support for data derived fromrats. We'll go with the latter approach since it's less typing. It matches the model we are interested inbrmsfor the various imputed data sets is straightforward.
fit_imp1<- brm_multiple(IMC~age*chl,data =HELL BOY,chains = 2)
The return mounted model is a common onebrmsfit
object containing everyone's next plansM
submodels. While clustering between models is not necessarily straightforward in classical statistics, it is trivial in a Bayesian framework. Here, grouping results from different imputed datasets is achieved by simply combining the posterior designs of the submodels. Likewise, all post-processing methods can be used immediately, without having to worry about grouping.
summary(fit_imp1)
Family: Gaussian Links: mu = identity; sigma = identity Type: bmi ~ age * chl Data: imp (Number of observations: 25) Designs: 10 chains, each with iter = 2000; heating = 1000; skinny = 1; total metathermal draws = 10000 Population Level Effects: Estimate Estimate Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSIntercept 13.40 8.16 -3.35 29.28 1.08 87 314 .87 314 1.69414 age . 2 43 chl 0.10 0.04 0.01 0.19 1.09 72 253age:chl -0.02 0.02 -0.07 0.02 1.10 67 198Family parameters: Estimate Estimate Error l-95% CI u-95% CI u-95%CI 0.08 4.60 1.26 28 89 The draws were random using (WALNUT) sampling. For each parameter, Bulk_ESS and Tail_ESS are effective measures of sample size, and Rhat is the potential downscaling factor in segregation chains (at convergence, Rhat = 1).
In the summary output, we notice that someAFFORDABLE
values are greater than\(1.1\)indicating potential convergence problems. For models based on multiple imputed data sets, this is usually afalse positives: Strings from different submodels may not overlap exactly as they match different data. We can see the chains on the right side of it
plot(fit_imp1,variable = "^b",regex = TRUE)
Such non-overlapping chains entail highAFFORDABLE
values without essentially having a convergence problem. Therefore, we need to investigate the convergence of the submodels separately, which we can do by looking
repetition(fit_imp1$that,2)
b_Intercept b_age b_chl b_age.chl sigma lprior lp__1 1,00 1,00 1,00 1,00 1,00 1,00 12 1,00 1,00 1,00 1,00 1,00 1,001 3 1,01 1,01 1,01 1,00 1. 00 14 1,00 1,00 1,00 1,00 1,01 1,01 15 1,00 1,00 1,00 1,00 1,00 1,00 1
The convergence of each of the submodels appears good. Consequently, we can proceed to post-processing and interpretation of the results. For example, we could investigate the combined effect ofage
mchl
.
conditional_effect(fit_imp1,"age:cl")
In summary, the advantages of multiple imputation are obvious: you can apply it to all types of models, since the model fitting functions do not need to know that the datasets have been imputed beforehand. We also do not need to worry about clustering between submodels when using fully Bayesian methods. The only downside is the time it takes to set up the model. Estimation of Bayesian models is already quite slow with a single data set and only gets worse when working with multiple imputations.
Compatibility with other multifactor packages
brmsoffers built-in support forratsmainly because I use the latter in some of my own research projects. However,brm_multiple
supports all types of multi-billing packages as it also accepts alistof data frames as input to yoursdata
disagreement. So you only need to export the implicit dataframes as a list, which you can pass tobrm_multiple
. Most multi imputation packages have some built-in functionality for this task. When you use itmpackage, for example, you just need to call itmy::complete
function to get the desired output.
Billing when placing the model
Imputation during model fitting is generally considered more complex than imputation before model fitting because everything has to be considered in one step. This remains true when missing values are imputedbrms, but probably to a slightly lesser extent. Think it over againnhanes
data for forecastingIMC
withage
, echl
. Fromage
it doesn't contain missing values, we just have to be extra carefulIMC
mchl
.We need to tell the model two things. (1) Which variables contain missing values and how they should be predicted, and (2) which of these imputed variables should be used as predictors. Inbrmswe can do it like this:
bform<- friend(IMC| m()~age* m(chl))+ friend(ch| m()~age)+ set_rescor(FALSE)fit_imp2<- brm(bforma,data =nhanes)
The model has become multivariate, as we not only predictIMC
but alsochl
(edvignette ( "brms_multivariate" )
for details on its multivariate syntaxbrms). We ensure that errors in both variables are modeled rather than excluded by addition| mi()
on the left side of the types2. We wrotemi(chl)
on the right side of the formula forIMC
to ensure that the estimated missing valueschl
will be used in the forecastIMC
. The summary is a bit more confusing as we get coefficients for both response variables, but other than that we can interpret the coefficients in the usual way.
summary(fit_imp2)
Family: MV (Gaussian, Gaussian) Links: mu = identity; sigma = identity mu = identity; sigma = identity Type: bmi | mi() ~ age * mi(chl) chl | mi() ~ age Data: nhanes (Number of observations: 25) Designs: 4 chains, each with iter = 2000; heating = 1000; skinny = 1; Total draws after warm = 4000 Population Level Effects: Estimate Estimate Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSbmi_Intercept 13.90 8.81 -3.24 31.08 1.00 17413_19 1.31 19 0 ,26 1.00 2937 2826BMI_AGE 2.76 5.61 -8.42 13.59 1.00 1510 1614CHL_AGE 28.51 13.58 1.95 56.23 1.00 2846 2473BMI_MICHL 0 .10 0.05 0.01 0 .19 11,003 miles: -0.08 0.02 1.00 14 30 1815Family Specific Parameters: Estimate Estimate Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsigma_bmi 3.40 0.81 2.21 5.30 1.00 1510 2211sigma_chl 40.57 7.76 28.83 58.99 1, 00 2221 2780 samples used NUTSmplings. For each parameter, Bulk_ESS and Tail_ESS are effective measures of sample size, and Rhat is the potential downscaling factor in segregation chains (at convergence, Rhat = 1).
conditional_effect(fit_imp2,"age:cl",resp = "bmi")
The results look quite similar to those obtained by multiple calculations, but be aware that this may not be the case in general. In multiple imputation, the default is to impute all variables based on all other variables, while in the one-step approach, we must explicitly specify the variables used in the imputation. Thus, arguably, multiple imputation is easier to implement. An obvious advantage of the "one-step" approach is that the model only needs to be adjusted once and notM
times. Also, within itbrmsframework, we can use multilevel structure and complex non-linear relationships to impute missing values, which is not so easily achieved in standard multiple imputation software. On the other hand, it is not currently possible to impute discrete variables becauseStan(the engine behindbrms) does not allow the estimation of discrete parameters.
Combination of measurement error and missing values
Value terms are missing inbrmsIt can not only handle missing values, but also measurement errors or arbitrary combinations of the two. In fact, we can think of a missing value as a value within infinite measurement error. Here's why,m
The terms are a natural (and slightly more detailed) generalization of the now obsolete softmy
terms. Suppose we had measured the variablechl
with a known error:
nhanes$com<- reprint(nrow(nhanes),2)
We can then go ahead and include this information in the model as follows:
bform<- friend(IMC| m()~age* m(chl))+ friend(ch| m(com)~age)+ set_rescor(FALSE)fit_imp3<- brm(bforma,data =nhanes)
Model summarization and post-processing continue to work normally.
bibliographical references
Buuren, S.V. & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation with chained equations in R.Journal of Statistical Software, 1-68. doi.org/10.18637/jss.v045.i03
Zhou, X. & Reiter, J.P. (2010). A note on Bayesian inference after multiple imputation.The American statistician, 64(2),159-163. doi.org/10.1198/tast.2010.09109
FAQs
What is the best way to handle missing values? ›
Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.
What are the four ways in handling missing values? ›- Now that you have found the missing data, how do you handle the missing values? ...
- Deleting the entire row (listwise deletion) ...
- Deleting the entire column. ...
- Replacing with an arbitrary value. ...
- Replacing with the mean. ...
- Replacing with the mode. ...
- Replacing with the median.
- Replace it with a constant value. ...
- Replace it with the mean or median. ...
- Replace it with values by using information from other columns.
To address supervised deep learning with missing values, we propose to marginalize over missing values in a joint model of covariates and outcomes.
What are the three types of missing values? ›- Missing completely at random (MCAR). When data are MCAR, the fact that the data are missing is independent of the observed and unobserved data. ...
- Missing at random (MAR). ...
- Missing not at random (MNAR).
Mean imputation is often used when the missing values are numerical and the distribution of the variable is approximately normal. Median imputation is preferred when the distribution is skewed, as the median is less sensitive to outliers than the mean.
What is the simplest method of imputation for missing data? ›The simplest imputation method is replacing missing values with the mean or median values of the dataset at large, or some similar summary statistic. This has the advantage of being the simplest possible approach, and one that doesn't introduce any undue bias into the dataset.
How do you handle data missing not at random? ›- Ensure your data are coded correctly.
- Identify missing values within each variable.
- Look for patterns of missingness.
- Check for associations between missing and observed data.
- Decide how to handle missing data.
- Acquire the dataset.
- Import all the crucial libraries.
- Import the dataset.
- Identifying and handling the missing values.
- Encoding the categorical data.
- Splitting the dataset.
- Feature scaling.
- Use the fillna() Method. The fillna() function iterates through your dataset and fills all empty rows with a specified value. ...
- The replace() Method. ...
- Fill Missing Data With interpolate()
How does XGBoost handle missing values? ›
XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training. Note that the gblinear booster treats missing values as zeros.
How do you deal with small size of training data? ›- Choose simple models. ...
- Remove outliers from data. ...
- Select relevant features. ...
- Combine several models. ...
- Rely on confidence intervals instead of point estimates. ...
- Extend the dataset. ...
- Apply transfer learning when possible.
The proposed approach (CNNI) is considered not only useful for handling large datasets but also helpful for generating reasonable values for the imputation of missing values. A new imputation approach based on a well-known deep convolutional neural network architecture is proposed.
How many missing values is too many? ›Statistical guidance articles have stated that bias is likely in analyses with more than 10% missingness and that if more than 40% data are missing in important variables then results should only be considered as hypothesis generating [18], [19].
How can you identify missing values in a data frame? ›The easiest way to check for missing values in a Pandas dataframe is via the isna() function. The isna() function returns a boolean (True or False) value if the Pandas column value is missing, so if you run df. isna() you'll get back a dataframe showing you a load of boolean values.
Which kind of analysis is used to find missing values? ›An EM analysis is used to estimate the means, correlations, and covariances. It is also used to determine that the data are missing completely at random. Missing values are then replaced by imputed values and saved into a new data file for further analysis.
How do you handle missing values in SQL? ›Using the SQL COALESCE function, it is easy to replace missing or NULL values in SELECT statements. Specific values can be set directly with COALESCE and the mean, median or mode can be used by combining COALESCE with WINDOW functions.
How do you handle missing values in tableau? ›To show missing values in a range, right-click (control-click on Mac) the date or bin headers and select Show Missing Values. Note: You can also perform calculations on missing values that are shown in the view. To do this, open the Analysis menu at the top, and then select Infer Properties from Missing Values.
How do you handle missing values in random forest? ›There are many imputation techniques we can employ to tackle missing values. For example, imputing means for continuous data is the most routine matter in the case of categorical data. Or we can use machine learning algorithms like KNN and Random Forests to address the missing data problems.
How do you handle missing values in target variable? ›Therefore, the best way to deal with missing target variable is to delete it. For other missing features, you can use impute strategies.