## Introduction

Many real-world datasets contain missing values for various reasons. We usually have a few options to deal with these missing values. The easiest solution is to remove all rows from the data set where one or more variables are missing. However, if values are not missing completely at random, this will likely introduce bias in our analysis. So we usually want to impute missing values one way or another. Here, we will consider two very general approaches using**brms**: (1) Assign missing values*before*model fitting with multiple imputation and (2) imputation of missing values in real time*during*model fit1. As a simple example, we will use`nhanes`

dataset, which contains information about the participants`age`

,`IMC`

(body mass index),`hip`

(hypertensive) and`chl`

(total serum cholesterol). For the purposes of this vignette, we are primarily interested in forecasting`IMC`

with`age`

m`chl`

.

`data("nhanes",package = "rats")head(nhanes)`

`idade IMC hip chl1 1 NA NA NA2 22,7 1 1873 1 NA 1 1874 3 NA NA NA5 1 20,4 1 1136 3 NA NA 184`

## Imputation before fitting the model

There are several approaches that allow us to impute missing data before the actual model fitting takes place. From a statistical point of view, multiple imputation is one of the best solutions. Each missing value is not imputed once, but`M`

times leading to a total`M`

fully imputed data sets. The model can then be fitted to each of these datasets separately and the results aggregated into models later. A widely implemented package for multiple imputation is**rats**(Buuren & Groothuis-Oudshoorn, 2010) and will then use it in conjunction with**brms**.Here we apply its default settings**rats**, which means that all variables will be used to impute missing values in all other variables and the imputation functions are automatically selected based on the characteristics of the variables.

`library(rats)HELL BOY<- rats(nhanes,m = 5,print = FALSE)`

Now, we have`m = 5`

imputed datasets stored in`HELL BOY`

object. In practice, we will probably need more than`5`

of them to accurately account for the uncertainty caused by the fault, perhaps even in its area`100`

imputed datasets (Zhou & Reiter, 2010). Of course, this greatly increases the computational load, so we keep it`m = 5`

for the purposes of this vignette. regardless of value`M`

, we can extract these datasets and then pass them to the actual model fitting function as a list of datasets or passes`HELL BOY`

directly. The latter works because**brms**offers special support for data derived from**rats**. We'll go with the latter approach since it's less typing. It matches the model we are interested in**brms**for the various imputed data sets is straightforward.

`fit_imp1<- brm_multiple(IMC~age*chl,data =HELL BOY,chains = 2)`

The return mounted model is a common one`brmsfit`

object containing everyone's next plans`M`

submodels. While clustering between models is not necessarily straightforward in classical statistics, it is trivial in a Bayesian framework. Here, grouping results from different imputed datasets is achieved by simply combining the posterior designs of the submodels. Likewise, all post-processing methods can be used immediately, without having to worry about grouping.

`summary(fit_imp1)`

`Family: Gaussian Links: mu = identity; sigma = identity Type: bmi ~ age * chl Data: imp (Number of observations: 25) Designs: 10 chains, each with iter = 2000; heating = 1000; skinny = 1; total metathermal draws = 10000 Population Level Effects: Estimate Estimate Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSIntercept 13.40 8.16 -3.35 29.28 1.08 87 314 .87 314 1.69414 age . 2 43 chl 0.10 0.04 0.01 0.19 1.09 72 253age:chl -0.02 0.02 -0.07 0.02 1.10 67 198Family parameters: Estimate Estimate Error l-95% CI u-95% CI u-95%CI 0.08 4.60 1.26 28 89 The draws were random using (WALNUT) sampling. For each parameter, Bulk_ESS and Tail_ESS are effective measures of sample size, and Rhat is the potential downscaling factor in segregation chains (at convergence, Rhat = 1).`

In the summary output, we notice that some`AFFORDABLE`

values are greater than\(1.1\)indicating potential convergence problems. For models based on multiple imputed data sets, this is usually a**false positives**: Strings from different submodels may not overlap exactly as they match different data. We can see the chains on the right side of it

`plot(fit_imp1,variable = "^b",regex = TRUE)`

Such non-overlapping chains entail high`AFFORDABLE`

values without essentially having a convergence problem. Therefore, we need to investigate the convergence of the submodels separately, which we can do by looking

`repetition(fit_imp1$that,2)`

`b_Intercept b_age b_chl b_age.chl sigma lprior lp__1 1,00 1,00 1,00 1,00 1,00 1,00 12 1,00 1,00 1,00 1,00 1,00 1,001 3 1,01 1,01 1,01 1,00 1. 00 14 1,00 1,00 1,00 1,00 1,01 1,01 15 1,00 1,00 1,00 1,00 1,00 1,00 1`

The convergence of each of the submodels appears good. Consequently, we can proceed to post-processing and interpretation of the results. For example, we could investigate the combined effect of`age`

m`chl`

.

`conditional_effect(fit_imp1,"age:cl")`

In summary, the advantages of multiple imputation are obvious: you can apply it to all types of models, since the model fitting functions do not need to know that the datasets have been imputed beforehand. We also do not need to worry about clustering between submodels when using fully Bayesian methods. The only downside is the time it takes to set up the model. Estimation of Bayesian models is already quite slow with a single data set and only gets worse when working with multiple imputations.

### Compatibility with other multifactor packages

**brms**offers built-in support for**rats**mainly because I use the latter in some of my own research projects. However,`brm_multiple`

supports all types of multi-billing packages as it also accepts a*list*of data frames as input to yours`data`

disagreement. So you only need to export the implicit dataframes as a list, which you can pass to`brm_multiple`

. Most multi imputation packages have some built-in functionality for this task. When you use it**m**package, for example, you just need to call it`my::complete`

function to get the desired output.

## Billing when placing the model

Imputation during model fitting is generally considered more complex than imputation before model fitting because everything has to be considered in one step. This remains true when missing values are imputed**brms**, but probably to a slightly lesser extent. Think it over again`nhanes`

data for forecasting`IMC`

with`age`

, e`chl`

. From`age`

it doesn't contain missing values, we just have to be extra careful`IMC`

m`chl`

.We need to tell the model two things. (1) Which variables contain missing values and how they should be predicted, and (2) which of these imputed variables should be used as predictors. In**brms**we can do it like this:

`bform<- friend(IMC| m()~age* m(chl))+ friend(ch| m()~age)+ set_rescor(FALSE)fit_imp2<- brm(bforma,data =nhanes)`

The model has become multivariate, as we not only predict`IMC`

but also`chl`

(ed`vignette ( "brms_multivariate" )`

for details on its multivariate syntax**brms**). We ensure that errors in both variables are modeled rather than excluded by addition`| mi()`

on the left side of the types2. We wrote`mi(chl)`

on the right side of the formula for`IMC`

to ensure that the estimated missing values`chl`

will be used in the forecast`IMC`

. The summary is a bit more confusing as we get coefficients for both response variables, but other than that we can interpret the coefficients in the usual way.

`summary(fit_imp2)`

`Family: MV (Gaussian, Gaussian) Links: mu = identity; sigma = identity mu = identity; sigma = identity Type: bmi | mi() ~ age * mi(chl) chl | mi() ~ age Data: nhanes (Number of observations: 25) Designs: 4 chains, each with iter = 2000; heating = 1000; skinny = 1; Total draws after warm = 4000 Population Level Effects: Estimate Estimate Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSbmi_Intercept 13.90 8.81 -3.24 31.08 1.00 17413_19 1.31 19 0 ,26 1.00 2937 2826BMI_AGE 2.76 5.61 -8.42 13.59 1.00 1510 1614CHL_AGE 28.51 13.58 1.95 56.23 1.00 2846 2473BMI_MICHL 0 .10 0.05 0.01 0 .19 11,003 miles: -0.08 0.02 1.00 14 30 1815Family Specific Parameters: Estimate Estimate Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsigma_bmi 3.40 0.81 2.21 5.30 1.00 1510 2211sigma_chl 40.57 7.76 28.83 58.99 1, 00 2221 2780 samples used NUTSmplings. For each parameter, Bulk_ESS and Tail_ESS are effective measures of sample size, and Rhat is the potential downscaling factor in segregation chains (at convergence, Rhat = 1).`

`conditional_effect(fit_imp2,"age:cl",resp = "bmi")`

The results look quite similar to those obtained by multiple calculations, but be aware that this may not be the case in general. In multiple imputation, the default is to impute all variables based on all other variables, while in the one-step approach, we must explicitly specify the variables used in the imputation. Thus, arguably, multiple imputation is easier to implement. An obvious advantage of the "one-step" approach is that the model only needs to be adjusted once and not`M`

times. Also, within it**brms**framework, we can use multilevel structure and complex non-linear relationships to impute missing values, which is not so easily achieved in standard multiple imputation software. On the other hand, it is not currently possible to impute discrete variables because**Stan**(the engine behind**brms**) does not allow the estimation of discrete parameters.

### Combination of measurement error and missing values

Value terms are missing in**brms**It can not only handle missing values, but also measurement errors or arbitrary combinations of the two. In fact, we can think of a missing value as a value within infinite measurement error. Here's why,`m`

The terms are a natural (and slightly more detailed) generalization of the now obsolete soft`my`

terms. Suppose we had measured the variable`chl`

with a known error:

`nhanes$com<- reprint(nrow(nhanes),2)`

We can then go ahead and include this information in the model as follows:

`bform<- friend(IMC| m()~age* m(chl))+ friend(ch| m(com)~age)+ set_rescor(FALSE)fit_imp3<- brm(bforma,data =nhanes)`

Model summarization and post-processing continue to work normally.

## bibliographical references

Buuren, S.V. & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation with chained equations in R.*Journal of Statistical Software*, 1-68. doi.org/10.18637/jss.v045.i03

Zhou, X. & Reiter, J.P. (2010). A note on Bayesian inference after multiple imputation.*The American statistician*, 64(2),159-163. doi.org/10.1198/tast.2010.09109

## FAQs

### What is the best way to handle missing values? ›

Missing values can be handled by **deleting the rows or columns having null values**. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.

**What are the four ways in handling missing values? ›**

**Handling Missing Values**

- Now that you have found the missing data, how do you handle the missing values? ...
- Deleting the entire row (listwise deletion) ...
- Deleting the entire column. ...
- Replacing with an arbitrary value. ...
- Replacing with the mean. ...
- Replacing with the mode. ...
- Replacing with the median.

**How do we handle missing values in the data cleaning process? ›**

**The first approach is to replace the missing value with one of the following strategies:**

- Replace it with a constant value. ...
- Replace it with the mean or median. ...
- Replace it with values by using information from other columns.

**How to deal with missing data in supervised deep learning? ›**

To address supervised deep learning with missing values, we propose to **marginalize over missing values in a joint model of covariates and outcomes**.

**What are the three types of missing values? ›**

**Missing data are typically grouped into three categories:**

- Missing completely at random (MCAR). When data are MCAR, the fact that the data are missing is independent of the observed and unobserved data. ...
- Missing at random (MAR). ...
- Missing not at random (MNAR).

**Should I replace missing values with mean or median? ›**

Mean imputation is often used when the missing values are numerical and the distribution of the variable is approximately normal. Median imputation is preferred when the distribution is skewed, as the median is less sensitive to outliers than the mean.

**What is the simplest method of imputation for missing data? ›**

The simplest imputation method is **replacing missing values with the mean or median values of the dataset at large, or some similar summary statistic**. This has the advantage of being the simplest possible approach, and one that doesn't introduce any undue bias into the dataset.

**How do you handle data missing not at random? ›**

**These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:**

- Ensure your data are coded correctly.
- Identify missing values within each variable.
- Look for patterns of missingness.
- Check for associations between missing and observed data.
- Decide how to handle missing data.

**What are the 5 major steps of data pre processing? ›**

**Summary:**

- Acquire the dataset.
- Import all the crucial libraries.
- Import the dataset.
- Identifying and handling the missing values.
- Encoding the categorical data.
- Splitting the dataset.
- Feature scaling.

**How do you fill missing values in a data set? ›**

**Now, check out how you can fill in these missing values using the various available methods in pandas.**

- Use the fillna() Method. The fillna() function iterates through your dataset and fills all empty rows with a specified value. ...
- The replace() Method. ...
- Fill Missing Data With interpolate()

### How does XGBoost handle missing values? ›

**XGBoost supports missing values by default**. In tree algorithms, branch directions for missing values are learned during training. Note that the gblinear booster treats missing values as zeros.

**How do you deal with small size of training data? ›**

**Techniques to Overcome Overfitting With Small Datasets**

- Choose simple models. ...
- Remove outliers from data. ...
- Select relevant features. ...
- Combine several models. ...
- Rely on confidence intervals instead of point estimates. ...
- Extend the dataset. ...
- Apply transfer learning when possible.

**Can CNN handle missing values? ›**

**The proposed approach (CNNI) is considered not only useful for handling large datasets but also helpful for generating reasonable values for the imputation of missing values**. A new imputation approach based on a well-known deep convolutional neural network architecture is proposed.

**How many missing values is too many? ›**

Statistical guidance articles have stated that bias is likely in analyses with **more than 10% missingness** and that if more than 40% data are missing in important variables then results should only be considered as hypothesis generating [18], [19].

**How can you identify missing values in a data frame? ›**

The easiest way to check for missing values in a Pandas dataframe is **via the isna() function**. The isna() function returns a boolean (True or False) value if the Pandas column value is missing, so if you run df. isna() you'll get back a dataframe showing you a load of boolean values.

**Which kind of analysis is used to find missing values? ›**

**An EM analysis** is used to estimate the means, correlations, and covariances. It is also used to determine that the data are missing completely at random. Missing values are then replaced by imputed values and saved into a new data file for further analysis.

**How do you handle missing values in SQL? ›**

**Using the SQL COALESCE function**, it is easy to replace missing or NULL values in SELECT statements. Specific values can be set directly with COALESCE and the mean, median or mode can be used by combining COALESCE with WINDOW functions.

**How do you handle missing values in tableau? ›**

To show missing values in a range, **right-click (control-click on Mac) the date or bin headers and select Show Missing Values**. Note: You can also perform calculations on missing values that are shown in the view. To do this, open the Analysis menu at the top, and then select Infer Properties from Missing Values.

**How do you handle missing values in random forest? ›**

There are many imputation techniques we can employ to tackle missing values. For example, imputing means for continuous data is the most routine matter in the case of categorical data. Or we can **use machine learning algorithms like KNN and Random Forests to address the missing data problems**.

**How do you handle missing values in target variable? ›**

Therefore, the best way to deal with missing target variable is to **delete it**. For other missing features, you can use impute strategies.