Struggling with missing data? We were too! We have had our fair share of encountering missing data (on homework assignments and practicum project). As my team worked on our practicum project, missing values seemed to be a larger part of the data exploration phase of the model building process than we initially thought. We’ve learned that this data exploration phase is arguably one of the most important parts of the modeling lifecycle. So, if you are struggling with missing data, learn from us! Here are three ways that our team handled missing values.
#1 Drop the variable(s) with missing data and/or create a binary variable.
If a variable has more than 50% missing values, drop the variable because you don’t have enough information to ascertain anything useful. The analyst could create a binary variable to capture the missing or non-missing values. This could be helpful and provide insights into the data. In fact, regardless of how many values are missing, using a binary variable to indicate if a missing value is present is good practice. It allows you to determine if the absence of a value is actually an important indicator that the values are not missing randomly or if a missing value actually has a meaning or broader implication on the target.
# 2 Impute the values of variable(s) with missing data.
If a variable has less than 50% missing values, imputation might be a viable option. There are several different ways to impute values of variables. One common method is mean imputation. Mean imputation is simply taking the mean of the existing values and replacing the missing values with the mean for the variable. The mean imputation can be effective when time is short due to the limited amount of coding needed. However, proceed with caution with imputation! Some variables, such as credit scores, should not be imputed due to the sensitivity of the variable. Variables that pertain to sensitive information should be left as is because it is very difficult to predict the variable since so many factors contribute to the sensitivity of the variable.
# 3 Create a predictive model for the variable’s missing values.
If a variable has fewer than 50% missing values, prediction is another viable option. However, the type of predictive model to predict the missing values is dependent on the variable. Regardless, an analyst can follow the model building process to determine and build the most effective model to predict the missing values. These predictions may help determine if the variable is important to the dataset in the grand scheme of things. Although predictive modeling is a viable option, creating a predictive model for missing data within a variable can be a time consuming task. Additionally, using predictor variables to predict the values of a variable as well as the target variable adds multicollinearity into the model, so any variables used to predict missing values cannot be used to predict the target.
Following these three practices will help you appropriately handle the missing values dilemma.
Columnist: Timothy Hetherington