One of the important stages of data mining is preprocessing, where we prepare the data for mining. Real-world data tends to be incomplete, noisy, and inconsistent and an important task when preprocessing the data is to fill in missing values, smooth out noise and correct inconsistencies.
1. Ignore the data row1
This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you’ll obviously get poor performance if the percentage of such rows is high.
For example, lets say we have a database of students enrolment data (age, SAT score, state of residence, etc.) and a column classifying their success in college to “Low”, “Medium” and “High”. Lets say our goal is do build a model predicting a student’s success in college. Data rows who are missing the success column are not useful in predicting success so they could very well be ignored and removed before running the algorithm.
2. Use a global constant to fill in for missing values
Decide on a new global constant value, like “unknown“, “N/A” or minus infinity, that will be used to fill all the missing values.
This technique is used because sometimes it just doesn’t make sense to try and predict the missing value.
For example, lets look at the students enrollment database again. Assuming the state of residence attribute data is missing for some students. Filling it up with some state doesn’t really makes sense as opposed to using something like “N/A”.
3. Use attribute mean
Replace missing values of an attribute with the mean (or median if its discrete) value for that attribute in the database.
For example, in a database of US family incomes, if the average income of a US family is X you can use that value to replace missing income values.
4. Use attribute mean for all samples belonging to the same class
Instead of using the mean (or median) of a certain attribute calculated by looking at all the rows in a database, we can limit the calculations to the relevant class to make the value more relevant to the row we’re looking at.
Lets say you have a cars pricing database that, among other things, classifies cars to “Luxury” and “Low budget” and you’re dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you’d get if you factor in the low budget cars.
5. Use a data mining algorithm to predict the most probable value
The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms (K-Mean\Median etc.).
For example, we could use a clustering algorithms to create clusters of rows which will then be used for calculating an attribute mean or median as specified in technique #3.
Another example could be using a decision tree to try and predict the probable value in the missing attribute, according to other attributes in the data.
Strategies for Dealing with Missing Data
There are three main strategies for dealing with missing data. The simplest solution for the missing values imputation problem is the reduction of the data set and elimination of all samples with missing values . Another solution is to treat missing values as special values. Finally missing values problem can be handled by various missing values imputation methods. Unfortunately missing values imputation methods are suitable only for missing values caused by missing completely at random (MCAR) and some of them for missing at random (MAR) mechanism. If missing values are caused by not missing at random mechanism (NMAR) it must be handled by going back to the source of data and obtaining more information or the appropriate model for the missing data mechanism have to be taken into account.
The Missing Values Problem
Missing value is a value that we intended to obtain during data collection (interview, measurement, observation) but we didn’t because of various reasons. Missing values can appear because respondent did not answer all questions in questionnaire, during manual data entry process, incorrect measurement, faulty experiment, some data are censored or anonymous and many others. Luengo, J. (2011) introduced three problems associated with missing values as – loss of efficiency, – complications in handling and analyzing the data, – bias resulting from differences between missing and complete data. Loss of efficiency is caused by time-consuming process of dealing with missing values. This is closely connected to the second problem – complications in handling and analyzing the data. Complications in handling and analyzing data lie in fact that most methods and algorithms are usually unable to deal with missing values and missing values problem must be solved prior to analyses during data preparation phase. The other problem – bias resulting from differences between missing and complete data lies in fact that imputed values are not exactly the same as known values of completed data set and should not be handled the same way. The same problem occurs also if data set is reduced and some cases (rows of data set) are removed or ignored.