Data Checks

EvalML provides data checks to help guide you in achieving the highest performing model. These utility functions help deal with problems such as overfitting, abnormal data, and missing data. These data checks can be found under evalml/data_checks. Below we will cover examples for each available data check in EvalML, as well as the DefaultDataChecks used in AutoMLSearch.search.

Missing Data

Missing data or rows with NaN values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation components to ensure that doesn’t happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values, a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By using HighlyNullDataCheck, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold.

[1]:
import numpy as np
import pandas as pd

from evalml.data_checks import HighlyNullDataCheck

X = pd.DataFrame([[1, 2, 3],
                  [0, 4, np.nan],
                  [1, 4, np.nan],
                  [9, 4, np.nan],
                  [8, 6, np.nan]])

null_check = HighlyNullDataCheck(pct_null_threshold=0.8)
results = null_check.validate(X)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: Column '2' is 80.0% or more null

Abnormal Data

EvalML provides a few data checks to check for abnormal data:

  • NoVarianceDataCheck

  • ClassImbalanceDataCheck

  • TargetLeakageDataCheck

  • InvalidTargetDataCheck

  • IDColumnsDataCheck

  • OutliersDataCheck

  • HighVarianceCVDataCheck

  • MulticollinearityDataCheck

Zero Variance

Data with zero variance indicates that all values are identical. If a feature has zero variance, it is not likely to be a useful feature. Similarly, if the target has zero variance, there is likely something wrong. NoVarianceDataCheck checks if the target or any feature has only one unique value and alerts you to any such columns.

[2]:
from evalml.data_checks import NoVarianceDataCheck
X = pd.DataFrame({"no var col": [0, 0, 0],
                 "good col":[0, 4, 1]})
y = pd.Series([1, 0, 1])
no_variance_data_check = NoVarianceDataCheck()
results = no_variance_data_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Error: no var col has 1 unique value.

Note that you can set NaN to count as an unique value, but NoVarianceDataCheck will still return a warning if there is only one unique non-NaN value in a given column.

[3]:
from evalml.data_checks import NoVarianceDataCheck

X = pd.DataFrame({"no var col": [0, 0, 0],
                 "no var col with nan": [1, np.nan, 1],
                 "good col":[0, 4, 1]})
y = pd.Series([1, 0, 1])

no_variance_data_check = NoVarianceDataCheck(count_nan_as_value=True)
results = no_variance_data_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: no var col with nan has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.
Error: no var col has 1 unique value.

Class Imbalance

For classification problems, the distribution of examples across each class can vary. For small variations, this is normal and expected. However, when the number of examples for each class label is disproportionately biased or skewed towards a particular class (or classes), it can be difficult for machine learning models to predict well. In addition, having a low number of examples for a given class could mean that one or more of the CV folds generated for the training data could only have few or no examples from that class. This may cause the model to only predict the majority class and ultimately resulting in a poor-performant model.

ClassImbalanceDataCheck checks if the target labels are imbalanced beyond a specified threshold for a certain number of CV folds. It returns DataCheckError messages for any classes that have less samples than double the number of CV folds specified (since that indicates the likelihood of having at little to no samples of that class in a given fold), and DataCheckWarning messages for any classes that fall below the set threshold percentage.

[4]:
from evalml.data_checks import ClassImbalanceDataCheck

X = pd.DataFrame([[1, 2, 0, 1],
                  [4, 1, 9, 0],
                  [4, 4, 8, 3],
                  [9, 2, 7, 1]])
y = pd.Series([0, 1, 1, 1, 1])

class_imbalance_check = ClassImbalanceDataCheck(threshold=0.25, num_cv_folds=4)
results = class_imbalance_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: The following labels fall below 25% of the target: [0]
Error: The number of instances of these targets is less than 2 * the number of cross folds = 8 instances: [1, 0]

Target Leakage

Target leakage, also known as data leakage, can occur when you train your model on a dataset that includes information that should not be available at the time of prediction. This causes the model to score suspiciously well, but perform poorly in production. TargetLeakageDataCheck checks for features that could potentially be “leaking” information by calculating the Pearson correlation coefficient between each feature and the target to warn users if there are features are highly correlated with the target. Currently, only numerical features are considered.

[5]:
from evalml.data_checks import TargetLeakageDataCheck
X = pd.DataFrame({'leak': [10, 42, 31, 51, 61],
'x': [42, 54, 12, 64, 12],
'y': [12, 5, 13, 74, 24]})
y = pd.Series([10, 42, 31, 51, 40])

target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8)
results = target_leakage_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: Column 'leak' is 80.0% or more correlated with the target
Warning: Column 'x' is 80.0% or more correlated with the target
Warning: Column 'y' is 80.0% or more correlated with the target

Invalid Target Data

The InvalidTargetDataCheck checks if the target data contains any missing or invalid values. Specifically:

  • if any of the target values are missing, a DataCheckError message is returned

  • if the specified problem type is a binary classification problem but there is more or less than two unique values in the target, a DataCheckError message is returned

  • if binary classification target classes are numeric values not equal to {0, 1}, a DataCheckError message is returned because it can cause unpredictable behavior when passed to pipelines

[6]:
from evalml.data_checks import InvalidTargetDataCheck

X = pd.DataFrame({})
y = pd.Series([0, 1, None, None])

invalid_target_check = InvalidTargetDataCheck('binary', 'Log Loss Binary')
results = invalid_target_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Error: 2 row(s) (50.0%) of target values are null

ID Columns

ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, IDColumnsDataCheck reminds you if these columns exists. In the given example, ‘user_number’ and ‘id’ columns are both identified as potentially being unique identifiers that should be removed.

[7]:
from evalml.data_checks import IDColumnsDataCheck

X = pd.DataFrame([[0, 53, 6325, 5],[1, 90, 6325, 10],[2, 90, 18, 20]], columns=['user_number', 'cost', 'revenue', 'id'])

id_col_check = IDColumnsDataCheck(id_threshold=0.9)
results = id_col_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: Column 'id' is 90.0% or more likely to be an ID column
Warning: Column 'user_number' is 90.0% or more likely to be an ID column

High Variance Cross-Validation Scores

The HighVarianceCVDataCheck data check is used in AutoMLSearch to detect if the variance between folds in cross-validation is higher than a specified threshold. High variance across cross-validation folds indicates that the underlying model may be overfitting to the fold data; this is unfavorable and can create an underperforming model in production.

The HighVarianceCVDataCheck is unique because it is the only data check that is not run before the search in AutoMLSearch.search() begins, but rather, during each CV fold.

Multicollinearity Data Check

The MulticollinearityDataCheck data check is used in to detect if are any set of features that are likely to be multicollinear. Multicollinear features affect the performance of a model, but more importantly, it may greatly impact model interpretation. EvalML uses mutual information to determine collinearity.

[8]:
from evalml.data_checks import MulticollinearityDataCheck

y = pd.Series([1, 0, 2, 3, 4])
X = pd.DataFrame({'col_1': y,
                      'col_2': y * 3,
                      'col_3': ~y,
                      'col_4': y / 2,
                      'col_5': y + 1,
                      'not_collinear': [0, 1, 0, 0, 0]})

multi_check = MulticollinearityDataCheck(threshold=0.95)
results = multi_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: Columns are likely to be correlated: [('col_1', 'col_2'), ('col_1', 'col_3'), ('col_1', 'col_4'), ('col_1', 'col_5'), ('col_2', 'col_3'), ('col_2', 'col_4'), ('col_2', 'col_5'), ('col_3', 'col_4'), ('col_3', 'col_5'), ('col_4', 'col_5')]

Outliers

Outliers are observations that differ significantly from other observations in the same sample. Many machine learning pipelines suffer in performance if outliers are not dropped from the training set as they are not representative of the data. OutliersDataCheck() uses IQR to notify you if a sample can be considered an outlier.

Below we generate a random dataset with some outliers.

[9]:
data = np.tile(np.arange(10) * 0.01, (100, 10))
X = pd.DataFrame(data=data)

# generate some outliers in columns 3, 25, 55, and 72
X.iloc[0, 3] = -10000
X.iloc[3, 25] = 10000
X.iloc[5, 55] = 10000
X.iloc[10, 72] = -10000

We then utilize OutliersDataCheck() to rediscover these outliers.

[10]:
from evalml.data_checks import OutliersDataCheck

outliers_check = OutliersDataCheck()
results = outliers_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: Column(s) '3', '25', '55', '72' are likely to have outlier data.

Data Check Messages

Each data check’s validate method returns a list of DataCheckMessage objects indicating warnings or errors found; warnings are stored as a DataCheckWarning object (API reference) and errors are stored as a DataCheckError object (API reference). You can filter the messages returned by a data check by checking for the type of message returned. Below, NoVarianceDataCheck returns a list containing a DataCheckWarning and a DataCheckError message. We can determine which is which by checking the type of each message.

[11]:
from evalml.data_checks import NoVarianceDataCheck, DataCheckError, DataCheckWarning

X = pd.DataFrame({"no var col": [0, 0, 0],
                 "no var col with nan": [1, np.nan, 1],
                 "good col":[0, 4, 1]})
y = pd.Series([1, 0, 1])

no_variance_data_check = NoVarianceDataCheck(count_nan_as_value=True)
results = no_variance_data_check.validate(X, y)

for message in results['warnings']:
    print("Warning:", message['message'])

for message in results['errors']:
    print("Error:", message['message'])
Warning: no var col with nan has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.
Error: no var col has 1 unique value.

Writing Your Own Data Check

If you would prefer to write your own data check, you can do so by extending the DataCheck class and implementing the validate(self, X, y) class method. Below, we’ve created a new DataCheck, ZeroVarianceDataCheck, which is similar to NoVarianceDataCheck defined in EvalML. The validate(self, X, y) method should return a dictionary with ‘warnings’ and ‘errors’ as keys mapping to list of warnings and errors, respectively.

[12]:
from evalml.data_checks import DataCheck

class ZeroVarianceDataCheck(DataCheck):
    def validate(self, X, y):
        messages = {'warnings': [], 'errors': []}
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        warning_msg = "Column '{}' has zero variance"
        messages['warnings'].extend([DataCheckError(warning_msg.format(column), self.name) for column in X.columns if len(X[column].unique()) == 1])

Defining Collections of Data Checks

For convenience, EvalML provides a DataChecks class to represent a collection of data checks. We will go over DefaultDataChecks (API reference), a collection defined and used in AutoMLSearch.

Default Data Checks

By default, AutoMLSearch.search runs a collection of data checks before it searches and iterates over pipelines. This collection of data checks is stored in the DefaultDataChecks class. It consists of a few data checks that are generally helpful for any machine learning problem. They are:

  • HighlyNullDataCheck

  • IDColumnsDataCheck

  • TargetLeakageDataCheck

  • InvalidTargetDataCheck

  • ClassImbalanceDataCheck (for classification problem types)

  • NoVarianceDataCheck

Writing Your Own Collection of Data Checks

If you would prefer to create your own collection of data checks, you could either write your own data checks class by extending the DataChecks class and setting the self.data_checks attribute to the list of DataCheck classes or objects, or you could pass that list of data checks to the constructor of the DataChecks class. Below, we create two identical collections of data checks using the two different methods.

[13]:
# Create a subclass of `DataChecks`
from evalml.data_checks import DataChecks, HighlyNullDataCheck, InvalidTargetDataCheck, NoVarianceDataCheck, ClassImbalanceDataCheck, TargetLeakageDataCheck
from evalml.problem_types import ProblemTypes, handle_problem_types

class MyCustomDataChecks(DataChecks):

    data_checks = [HighlyNullDataCheck, InvalidTargetDataCheck, NoVarianceDataCheck, TargetLeakageDataCheck]

    def __init__(self, problem_type, objective):
        """
        A collection of basic data checks.
        Arguments:
            problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass.
        """
        if handle_problem_types(problem_type) == ProblemTypes.REGRESSION:
            super().__init__(self.data_checks,
                             data_check_params={"InvalidTargetDataCheck": {"problem_type": problem_type,
                                                                           "objective": objective}})
        else:
            super().__init__(self.data_checks + [ClassImbalanceDataCheck],
                             data_check_params={"InvalidTargetDataCheck": {"problem_type": problem_type,
                                                                           "objective": objective}})


custom_data_checks = MyCustomDataChecks(problem_type=ProblemTypes.REGRESSION, objective="R2")
for data_check in custom_data_checks.data_checks:
    print(data_check.name)
HighlyNullDataCheck
InvalidTargetDataCheck
NoVarianceDataCheck
TargetLeakageDataCheck
[14]:
# Pass list of data checks to the `data_checks` parameter of DataChecks
same_custom_data_checks = DataChecks(data_checks=[HighlyNullDataCheck, InvalidTargetDataCheck, NoVarianceDataCheck, TargetLeakageDataCheck],
                                    data_check_params={"InvalidTargetDataCheck": {"problem_type": ProblemTypes.REGRESSION,
                                                                                  "objective": "R2"}})
for data_check in custom_data_checks.data_checks:
    print(data_check.name)
HighlyNullDataCheck
InvalidTargetDataCheck
NoVarianceDataCheck
TargetLeakageDataCheck