Understanding Data Check Actions

EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is data checks, which help determine the health of our data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:

  • HighlyNullDataCheck: Checks whether the rows or columns are highly null

  • IDColumnsDataCheck: Checks for columns that could be ID columns

  • TargetLeakageDataCheck: Checks if any of the input features have high association with the targets

  • InvalidTargetDataCheck: Checks if there are null or other invalid values in the target

  • NoVarianceDataCheck: Checks if either the target or any features have no variance

  • NaturalLanguageNaNDataCheck: Checks if any natural language columns have missing data

  • DateTimeNaNDataCheck: Checks if any datetime columns have missing data

EvalML has additional data checks that can be seen here, with usage examples here. Below, we will walk through usage of EvalML’s default data checks and actions.

First, we import the necessary requirements to demonstrate these checks.

[1]:
import woodwork as ww
import pandas as pd
from evalml import AutoMLSearch
from evalml.demos import load_fraud
from evalml.preprocessing import split_data

Let’s look at the input feature data. EvalML uses the Woodwork library to represent this data. The demo data that EvalML returns is a Woodwork DataTable and DataColumn.

[2]:
X, y = load_fraud(n_rows=1500)
X.head()
             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 1500
Targets
False    86.60%
True     13.40%
Name: fraud, dtype: object
[2]:
card_id store_id datetime amount currency customer_present expiration_date provider lat lng region country
id
0 32261 8516 2019-01-01 00:12:26 24900 CUC True 08/24 Mastercard 38.58894 -89.99038 Fairview Heights US
1 16434 8516 2019-01-01 09:42:03 15789 MYR False 11/21 Discover 38.58894 -89.99038 Fairview Heights US
2 23468 8516 2019-04-17 08:17:01 1883 AUD False 09/27 Discover 38.58894 -89.99038 Fairview Heights US
3 14364 8516 2019-01-30 11:54:30 82120 KRW True 09/20 JCB 16 digit 38.58894 -89.99038 Fairview Heights US
4 29407 8516 2019-05-01 17:59:36 25745 MUR True 09/22 American Express 38.58894 -89.99038 Fairview Heights US

Adding noise and unclean data

This data is already clean and compatible with EvalML’s AutoMLSearch. In order to demonstrate EvalML default data checks, we will add the following:

  • A column of mostly null values (<0.5% non-null)

  • A column with low/no variance

  • A row of null values

  • A missing target value

We will add the first two columns to the whole dataset and we will only add the last two to the training data. Note: these only represent some of the scenarios that EvalML default data checks can catch.

[3]:
# add a column with no variance in the data
X['no_variance'] = [1 for _ in range(X.shape[0])]

# add a column with >99.5% null values
X['mostly_nulls'] = [None] * (X.shape[0] - 5) + [i for i in range(5)]

# since we changed the data, let's reinitialize the woodwork datatable
X.ww.init()
# let's split some training and validation data
X_train, X_valid, y_train, y_valid = split_data(X, y, problem_type='binary')
[4]:
# let's copy the datetime at row 1 for future use
date = X_train.iloc[1]['datetime']

# make row 1 all nan values
X_train.iloc[1] = [None] * X_train.shape[1]

# make one of the target values null
y_train[990] = None

X_train.ww.init()
y_train = ww.init_series(y_train)
# Let's take another look at the new X_train data
X_train
[4]:
card_id store_id datetime amount currency customer_present expiration_date provider lat lng region country no_variance mostly_nulls
id
872 15492.0 2868.0 2019-08-03 02:50:04 80719.0 HNL True 08/27 American Express 5.47090 100.24529 Batu Feringgi MY 1.0 NaN
1477 NaN NaN NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
158 22440.0 6813.0 2019-07-12 11:07:25 1849.0 SEK True 09/20 American Express 26.26490 81.54855 Jais IN 1.0 NaN
808 8096.0 8096.0 2019-06-11 21:33:36 41358.0 MOP True 04/29 VISA 13 digit 59.37722 28.19028 Narva EE 1.0 NaN
336 33270.0 1529.0 2019-03-23 21:44:00 32594.0 CUC False 04/22 Mastercard 51.39323 0.47713 Strood GB 1.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 8484.0 5358.0 2019-01-10 07:47:28 89503.0 GMD False 11/24 Maestro 47.30997 8.52462 Adliswil CH 1.0 NaN
1383 17565.0 3929.0 2019-01-15 01:11:02 14264.0 DKK True 06/20 VISA 13 digit 50.72043 11.34046 Rudolstadt DE 1.0 NaN
893 108.0 44.0 2019-05-17 00:53:39 93218.0 SLL True 12/24 JCB 16 digit 15.72892 120.57224 Burgos PH 1.0 NaN
385 29983.0 152.0 2019-06-09 06:50:29 41105.0 RWF False 07/20 JCB 16 digit -6.80000 39.25000 Magomeni TZ 1.0 NaN
1074 26197.0 4927.0 2019-05-22 15:57:27 50481.0 MNT False 05/26 JCB 15 digit 41.00510 -73.78458 Scarsdale US 1.0 NaN

1200 rows × 14 columns

If we call AutoMLSearch.search() on this data, the search will fail due to the columns and issues we’ve added above. Note: we use a try/except here to catch the resulting ValueError that AutoMLSearch raises.

[5]:
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
try:
    automl.search()
except ValueError as e:
    # to make the error message more distinct
    print("=" * 80, "\n")
    print("Search errored out! Message received is: {}".format(e))
    print("=" * 80, "\n")
================================================================================

Search errored out! Message received is: Input contains NaN, infinity or a value too large for dtype('float64').
================================================================================

We can use the search_iterative() function provided in EvalML to determine what potential health issues our data has. We can see that this search_iterative function is a public method available through evalml.automl and is different from the search function of the AutoMLSearch class in EvalML. This search_iterative() function allows us to run the default data checks on the data, and, if there are no errors, automatically runs AutoMLSearch.search().

[6]:
from evalml.automl import search_iterative
results = search_iterative(X_train, y_train, problem_type='binary')
results
[6]:
(None,
 {'warnings': [{'message': '1 out of 1200 rows are 95.0% or more null',
    'data_check_name': 'HighlyNullDataCheck',
    'level': 'warning',
    'details': {'columns': None,
     'rows': [1477],
     'pct_null_cols': id
     1477    1.0
     dtype: float64},
    'code': 'HIGHLY_NULL_ROWS'},
   {'message': "Columns 'mostly_nulls' are 95.0% or more null",
    'data_check_name': 'HighlyNullDataCheck',
    'level': 'warning',
    'details': {'columns': ['mostly_nulls'],
     'rows': None,
     'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
    'code': 'HIGHLY_NULL_COLS'}],
  'errors': [{'message': '1 row(s) (0.08333333333333334%) of target values are null',
    'data_check_name': 'InvalidTargetDataCheck',
    'level': 'error',
    'details': {'columns': None,
     'rows': None,
     'num_null_rows': 1,
     'pct_null_rows': 0.08333333333333334},
    'code': 'TARGET_HAS_NULL'},
   {'message': "'no_variance' has 1 unique value.",
    'data_check_name': 'NoVarianceDataCheck',
    'level': 'error',
    'details': {'columns': ['no_variance'], 'rows': None},
    'code': 'NO_VARIANCE'},
   {'message': 'Input datetime column(s) (datetime) contains NaN values. Please impute NaN values or drop these rows or columns.',
    'data_check_name': 'DateTimeNaNDataCheck',
    'level': 'error',
    'details': {'columns': ['datetime'], 'rows': None},
    'code': 'DATETIME_HAS_NAN'}],
  'actions': [{'code': 'DROP_ROWS',
    'data_check_name': 'HighlyNullDataCheck',
    'metadata': {'columns': None, 'rows': [1477]}},
   {'code': 'DROP_COL',
    'data_check_name': 'HighlyNullDataCheck',
    'metadata': {'columns': ['mostly_nulls'], 'rows': None}},
   {'code': 'IMPUTE_COL',
    'data_check_name': 'InvalidTargetDataCheck',
    'metadata': {'columns': None,
     'rows': None,
     'is_target': True,
     'impute_strategy': 'most_frequent'}},
   {'code': 'DROP_COL',
    'data_check_name': 'NoVarianceDataCheck',
    'metadata': {'columns': ['no_variance'], 'rows': None}}]})

The return value of the search_iterative function above is a tuple. The first element is the AutoMLSearch object if it runs (and None otherwise), and the second element is a dictionary of potential warnings and errors that the default data checks find on the passed-in X and y data. In this dictionary, warnings are suggestions that the data checks give that can useful to address to make the search better but will not break AutoMLSearch. On the flip side, errors indicate issues that will break AutoMLSearch and need to be addressed by the user.

Above, we can see that there were errors so search did not automatically run.

Addressing warnings and errors

We can automatically address the warnings and errors returned by search_iterative by using make_pipeline_from_actions, a utility method that creates a pipeline that will automatically clean up our data. We just need to pass this method DataCheckAction objects and our problem type.

[7]:
# Data check output returns a list of the dictionary version of each action that should be taken to clean up the data
results[1]['actions']
[7]:
[{'code': 'DROP_ROWS',
  'data_check_name': 'HighlyNullDataCheck',
  'metadata': {'columns': None, 'rows': [1477]}},
 {'code': 'DROP_COL',
  'data_check_name': 'HighlyNullDataCheck',
  'metadata': {'columns': ['mostly_nulls'], 'rows': None}},
 {'code': 'IMPUTE_COL',
  'data_check_name': 'InvalidTargetDataCheck',
  'metadata': {'columns': None,
   'rows': None,
   'is_target': True,
   'impute_strategy': 'most_frequent'}},
 {'code': 'DROP_COL',
  'data_check_name': 'NoVarianceDataCheck',
  'metadata': {'columns': ['no_variance'], 'rows': None}}]
[8]:
from evalml.pipelines.utils import make_pipeline_from_actions
from evalml.data_checks import DataCheckAction

# Convert dictionary form of actions returned from data check output dictionary as DataCheckAction objects
actions = [
        DataCheckAction.convert_dict_to_action(action)
        for action in results[1]['actions']
]

actions_pipeline = make_pipeline_from_actions("binary", actions)
actions_pipeline.fit(X_train, y_train)
X_train_cleaned, y_train_cleaned = actions_pipeline.transform(X_train, y_train)
print("The new length of X_train is {} and y_train is {}".format(len(X_train_cleaned),len(X_train_cleaned)))
The new length of X_train is 1199 and y_train is 1199

Now, we can run search_iterative to completion.

[9]:
results_cleaned = search_iterative(X_train_cleaned, y_train_cleaned, problem_type='binary')
        High coefficient of variation (cv >= 0.5) within cross validation scores.
        Decision Tree Classifier w/ Label Encoder + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler may not perform as estimated on unseen data.

Note that this time, we get an AutoMLSearch object returned to us as the first element of the tuple. We can use and inspect the AutoMLSearch object as needed.

[10]:
automl_object = results_cleaned[0]
automl_object.rankings
[10]:
id pipeline_name search_order mean_cv_score standard_deviation_cv_score validation_score percent_better_than_baseline high_variance_cv parameters
0 3 XGBoost Classifier w/ Label Encoder + DateTime... 3 0.246143 0.040849 0.246143 94.692653 False {'Label Encoder': {'positive_label': None}, 'D...
1 6 Random Forest Classifier w/ Label Encoder + Da... 6 0.269118 0.007188 0.269118 94.197251 False {'Label Encoder': {'positive_label': None}, 'D...
2 4 LightGBM Classifier w/ Label Encoder + DateTim... 4 0.343009 0.059643 0.343009 92.604012 False {'Label Encoder': {'positive_label': None}, 'D...
3 8 Extra Trees Classifier w/ Label Encoder + Date... 8 0.356543 0.006981 0.356543 92.312192 False {'Label Encoder': {'positive_label': None}, 'D...
4 1 Elastic Net Classifier w/ Label Encoder + Date... 1 0.390674 0.022512 0.390674 91.576270 False {'Label Encoder': {'positive_label': None}, 'D...
5 2 Logistic Regression Classifier w/ Label Encode... 2 0.393342 0.022323 0.393342 91.518737 False {'Label Encoder': {'positive_label': None}, 'D...
6 5 CatBoost Classifier w/ Label Encoder + DateTim... 5 0.546942 0.001789 0.546942 88.206800 False {'Label Encoder': {'positive_label': None}, 'D...
7 7 Decision Tree Classifier w/ Label Encoder + Da... 7 1.094903 0.411451 1.094903 76.391644 True {'Label Encoder': {'positive_label': None}, 'D...
8 0 Mode Baseline Binary Classification Pipeline 0 4.637776 0.043230 4.637776 0.000000 False {'Label Encoder': {'positive_label': None}, 'B...

If we check the second element in the tuple, we can see that there are no longer any warnings or errors detected!

[11]:
data_check_results = results_cleaned[1]
data_check_results
[11]:
{'warnings': [], 'errors': [], 'actions': []}

Only addressing DataCheck errors

Previously, we used make_pipeline_from_actions to address all of the warnings and errors returned by search_iterative. We will now show how we can also manually address errors to allow AutoMLSearch to run, and how ignoring warnings will come at the expense of performance.

We can print out the errors first to make it easier to read, and then we’ll create new features and targets from the original training data.

[12]:
results[1]['errors']
[12]:
[{'message': '1 row(s) (0.08333333333333334%) of target values are null',
  'data_check_name': 'InvalidTargetDataCheck',
  'level': 'error',
  'details': {'columns': None,
   'rows': None,
   'num_null_rows': 1,
   'pct_null_rows': 0.08333333333333334},
  'code': 'TARGET_HAS_NULL'},
 {'message': "'no_variance' has 1 unique value.",
  'data_check_name': 'NoVarianceDataCheck',
  'level': 'error',
  'details': {'columns': ['no_variance'], 'rows': None},
  'code': 'NO_VARIANCE'},
 {'message': 'Input datetime column(s) (datetime) contains NaN values. Please impute NaN values or drop these rows or columns.',
  'data_check_name': 'DateTimeNaNDataCheck',
  'level': 'error',
  'details': {'columns': ['datetime'], 'rows': None},
  'code': 'DATETIME_HAS_NAN'}]
[13]:
# copy the DataTables to new variables
X_train_no_errors = X_train.copy()
y_train_no_errors = y_train.copy()

# We address the errors by looking at the resulting dictionary errors listed

# first, let's address the `TARGET_HAS_NULL` error
y_train_no_errors.fillna(False, inplace=True)

# here, we address the `NO_VARIANCE` error
X_train_no_errors.drop("no_variance", axis=1, inplace=True)

# lastly, we address the `DATETIME_HAS_NAN` error with the date we had saved earlier
X_train_no_errors.iloc[1, 2] = date

# let's reinitialize the Woodwork DataTable
X_train_no_errors.ww.init()
X_train_no_errors.head()
[13]:
card_id store_id datetime amount currency customer_present expiration_date provider lat lng region country mostly_nulls
id
872 15492.0 2868.0 2019-08-03 02:50:04 80719.0 HNL True 08/27 American Express 5.47090 100.24529 Batu Feringgi MY NaN
1477 NaN NaN 2019-08-05 21:05:57 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
158 22440.0 6813.0 2019-07-12 11:07:25 1849.0 SEK True 09/20 American Express 26.26490 81.54855 Jais IN NaN
808 8096.0 8096.0 2019-06-11 21:33:36 41358.0 MOP True 04/29 VISA 13 digit 59.37722 28.19028 Narva EE NaN
336 33270.0 1529.0 2019-03-23 21:44:00 32594.0 CUC False 04/22 Mastercard 51.39323 0.47713 Strood GB NaN

We can now run search on X_train_no_errors and y_train_no_errors. Note that the search here doesn’t fail since we addressed the errors, but there will still exist warnings in the returned tuple. This search allows the mostly_nulls column to remain in the features during search.

[14]:
results_no_errors = search_iterative(X_train_no_errors, y_train_no_errors, problem_type='binary')
results_no_errors
        High coefficient of variation (cv >= 0.5) within cross validation scores.
        Decision Tree Classifier w/ Label Encoder + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler may not perform as estimated on unseen data.
[14]:
(<evalml.automl.automl_search.AutoMLSearch at 0x7f9509fc6fa0>,
 {'warnings': [{'message': "Columns 'mostly_nulls' are 95.0% or more null",
    'data_check_name': 'HighlyNullDataCheck',
    'level': 'warning',
    'details': {'columns': ['mostly_nulls'],
     'rows': None,
     'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
    'code': 'HIGHLY_NULL_COLS'}],
  'errors': [],
  'actions': [{'code': 'DROP_COL',
    'data_check_name': 'HighlyNullDataCheck',
    'metadata': {'columns': ['mostly_nulls'], 'rows': None}}]})

Comparing removing only errors versus removing both warnings and errors

Let’s see the differences in model performance when we remove only errors versus remove both warnings and errors. To do this, we compare the performance of the best pipelines on the validation data. Remember that in the search where we only address errors, we still have the mostly_nulls column present in the data, so we leave that column in the validation data for its respective search. We drop the other no_variance column from both searches.

Additionally, we do some logical type setting since we had added additional noise to just the training data. This allows the data to be of the same types in both training and validation.

[15]:
# drop the no_variance column
X_valid.drop("no_variance", axis=1, inplace=True)

# logical type management
X_valid.ww.init(logical_types={"customer_present": "Categorical"})
y_valid = ww.init_series(y_valid, logical_type="Categorical")

best_pipeline_no_errors = results_no_errors[0].best_pipeline
print("Only dropping errors:", best_pipeline_no_errors.score(X_valid, y_valid, ["Log Loss Binary"]), "\n")

# drop the mostly_nulls column and reinitialize the DataTable
X_valid.drop("mostly_nulls", axis=1, inplace=True)
X_valid.ww.init()

best_pipeline_clean = results_cleaned[0].best_pipeline
print("Addressing all actions:", best_pipeline_clean.score(X_valid, y_valid, ["Log Loss Binary"]), "\n")
Only dropping errors: OrderedDict([('Log Loss Binary', 0.23710868120188716)])

Addressing all actions: OrderedDict([('Log Loss Binary', 0.22485581094121954)])

We can compare the differences in model performance when we address all action items (warnings and errors) in comparison to when we only address errors. While it isn’t guaranteed that addressing all actions will always have better performance, we do recommend doing so since we only raise these issues when we believe the features have problems that could negatively impact or not benefit the search.