Understanding Data Check Actions#

EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is data checks, which help determine the health of our data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:

NullDataCheck: Checks whether the rows or columns are null or highly null
IDColumnsDataCheck: Checks for columns that could be ID columns
TargetLeakageDataCheck: Checks if any of the input features have high association with the targets
InvalidTargetDataCheck: Checks if there are null or other invalid values in the target
NoVarianceDataCheck: Checks if either the target or any features have no variance

EvalML has additional data checks that can be seen here, with usage examples here. Below, we will walk through usage of EvalML’s default data checks and actions.

First, we import the necessary requirements to demonstrate these checks.

[1]:

import woodwork as ww
import pandas as pd
from evalml import AutoMLSearch
from evalml.demos import load_fraud
from evalml.preprocessing import split_data

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)

Let’s look at the input feature data. EvalML uses the Woodwork library to represent this data. The demo data that EvalML returns is a Woodwork DataTable and DataColumn.

[2]:

X, y = load_fraud(n_rows=1500)
X.head()

             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 1500
Targets
False    86.60%
True     13.40%
Name: fraud, dtype: object

[2]:

	card_id	store_id	datetime	amount	currency	customer_present	expiration_date	provider	lat	lng	region	country
id
0	32261	8516	2019-01-01 00:12:26	24900	CUC	True	08/24	Mastercard	38.58894	-89.99038	Fairview Heights	US
1	16434	8516	2019-01-01 09:42:03	15789	MYR	False	11/21	Discover	38.58894	-89.99038	Fairview Heights	US
2	23468	8516	2019-04-17 08:17:01	1883	AUD	False	09/27	Discover	38.58894	-89.99038	Fairview Heights	US
3	14364	8516	2019-01-30 11:54:30	82120	KRW	True	09/20	JCB 16 digit	38.58894	-89.99038	Fairview Heights	US
4	29407	8516	2019-05-01 17:59:36	25745	MUR	True	09/22	American Express	38.58894	-89.99038	Fairview Heights	US

Adding noise and unclean data#

This data is already clean and compatible with EvalML’s AutoMLSearch. In order to demonstrate EvalML default data checks, we will add the following:

A column of mostly null values (<0.5% non-null)
A column with low/no variance
A row of null values
A missing target value

We will add the first two columns to the whole dataset and we will only add the last two to the training data. Note: these only represent some of the scenarios that EvalML default data checks can catch.

[3]:

# add a column with no variance in the data
X["no_variance"] = [1 for _ in range(X.shape[0])]

# add a column with >99.5% null values
X["mostly_nulls"] = [None] * (X.shape[0] - 5) + [i for i in range(5)]

# since we changed the data, let's reinitialize the woodwork datatable
X.ww.init()
# let's split some training and validation data
X_train, X_valid, y_train, y_valid = split_data(X, y, problem_type="binary")

[4]:

# make row 1 all nan values
X_train.iloc[1] = [None] * X_train.shape[1]

# make one of the target values null
y_train[990] = None

X_train.ww.init()
y_train = ww.init_series(y_train, logical_type="Categorical")
# Let's take another look at the new X_train data
X_train

[4]:

	card_id	store_id	datetime	amount	currency	customer_present	expiration_date	provider	lat	lng	region	country	no_variance	mostly_nulls
id
872	15492	2868	2019-08-03 02:50:04	80719	HNL	True	08/27	American Express	5.47090	100.24529	Batu Feringgi	MY	1	<NA>
1477	<NA>	<NA>	NaT	<NA>	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	<NA>
158	22440	6813	2019-07-12 11:07:25	1849	SEK	True	09/20	American Express	26.26490	81.54855	Jais	IN	1	<NA>
808	8096	8096	2019-06-11 21:33:36	41358	MOP	True	04/29	VISA 13 digit	59.37722	28.19028	Narva	EE	1	<NA>
336	33270	1529	2019-03-23 21:44:00	32594	CUC	False	04/22	Mastercard	51.39323	0.47713	Strood	GB	1	<NA>
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
339	8484	5358	2019-01-10 07:47:28	89503	GMD	False	11/24	Maestro	47.30997	8.52462	Adliswil	CH	1	<NA>
1383	17565	3929	2019-01-15 01:11:02	14264	DKK	True	06/20	VISA 13 digit	50.72043	11.34046	Rudolstadt	DE	1	<NA>
893	108	44	2019-05-17 00:53:39	93218	SLL	True	12/24	JCB 16 digit	15.72892	120.57224	Burgos	PH	1	<NA>
385	29983	152	2019-06-09 06:50:29	41105	RWF	False	07/20	JCB 16 digit	-6.80000	39.25000	Magomeni	TZ	1	<NA>
1074	26197	4927	2019-05-22 15:57:27	50481	MNT	False	05/26	JCB 15 digit	41.00510	-73.78458	Scarsdale	US	1	<NA>

1200 rows × 14 columns

If we call AutoMLSearch.search() on this data, the search will fail due to the columns and issues we’ve added above. Note: we use a try/except here to catch the resulting ValueError that AutoMLSearch raises.

[5]:

automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary")
try:
    automl.search()
except ValueError as e:
    # to make the error message more distinct
    print("=" * 80, "\n")
    print("Search errored out! Message received is: {}".format(e))
    print("=" * 80, "\n")

================================================================================

Search errored out! Message received is: Input y contains NaN.
================================================================================

We can use the search_iterative() function provided in EvalML to determine what potential health issues our data has. We can see that this search_iterative function is a public method available through evalml.automl and is different from the search function of the AutoMLSearch class in EvalML. This search_iterative() function allows us to run the default data checks on the data, and, if there are no errors, automatically runs AutoMLSearch.search().

[6]:

from evalml.automl import search_iterative

automl, messages = search_iterative(X_train, y_train, problem_type="binary")
automl, messages

One or more pairs of columns did not share enough rows of non-null data to measure the relationship.  The measurement for these columns will be NaN.  Use 'extra_stats=True' to get the shared rows for each pair of columns.

[6]:

(None,
 [{'message': '1 out of 1200 rows are 95.0% or more null',
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': None,
    'rows': [1477],
    'pct_null_cols': id
    1477    1.0
    dtype: float64},
   'code': 'HIGHLY_NULL_ROWS',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': None, 'rows': [1477]},
     'parameters': {}}]},
  {'message': "Column(s) 'mostly_nulls' are 95.0% or more null",
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': ['mostly_nulls'],
    'rows': None,
    'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
   'code': 'HIGHLY_NULL_COLS',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': ['mostly_nulls'], 'rows': None},
     'parameters': {}}]},
  {'message': '1 row(s) (0.08333333333333334%) of target values are null',
   'data_check_name': 'InvalidTargetDataCheck',
   'level': 'error',
   'details': {'columns': None,
    'rows': [990],
    'num_null_rows': 1,
    'pct_null_rows': 0.08333333333333334},
   'code': 'TARGET_HAS_NULL',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'InvalidTargetDataCheck',
     'metadata': {'columns': None, 'rows': [990], 'is_target': True},
     'parameters': {}}]},
  {'message': "'no_variance' has 1 unique value.",
   'data_check_name': 'NoVarianceDataCheck',
   'level': 'warning',
   'details': {'columns': ['no_variance'], 'rows': None},
   'code': 'NO_VARIANCE',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NoVarianceDataCheck',
     'metadata': {'columns': ['no_variance'], 'rows': None},
     'parameters': {}}]}])

The return value of the search_iterative function above is a tuple. The first element is the AutoMLSearch object if it runs (and None otherwise), and the second element is a dictionary of potential warnings and errors that the default data checks find on the passed-in X and y data. In this dictionary, warnings are suggestions that the data checks give that can useful to address to make the search better but will not break AutoMLSearch. On the flip side, errors indicate issues that will break AutoMLSearch and need to be addressed by the user.

Above, we can see that there were errors so search did not automatically run.

Addressing warnings and errors#

We can automatically address the warnings and errors returned by search_iterative by using make_pipeline_from_data_check_output, a utility method that creates a pipeline that will automatically clean up our data. We just need to pass this method the messages from running DataCheck.validate() and our problem type.

[7]:

from evalml.pipelines.utils import make_pipeline_from_data_check_output

actions_pipeline = make_pipeline_from_data_check_output("binary", messages)
actions_pipeline.fit(X_train, y_train)
X_train_cleaned, y_train_cleaned = actions_pipeline.transform(X_train, y_train)
print(
    "The new length of X_train is {} and y_train is {}".format(
        len(X_train_cleaned), len(X_train_cleaned)
    )
)

The new length of X_train is 1198 and y_train is 1198

Now, we can run search_iterative to completion.

[8]:

results_cleaned = search_iterative(
    X_train_cleaned, y_train_cleaned, problem_type="binary"
)

Note that this time, we get an AutoMLSearch object returned to us as the first element of the tuple. We can use and inspect the AutoMLSearch object as needed.

[9]:

automl_object = results_cleaned[0]
automl_object.rankings

[9]:

	id	pipeline_name	search_order	ranking_score	mean_cv_score	standard_deviation_cv_score	percent_better_than_baseline	high_variance_cv	parameters
0	1	Random Forest Classifier w/ Label Encoder + Da...	1	0.262989	0.262989	0.007899	94.570726	False	{'Label Encoder': {'positive_label': None}, 'D...
1	0	Mode Baseline Binary Classification Pipeline	0	4.843912	4.843912	0.049015	0.000000	False	{'Label Encoder': {'positive_label': None}, 'B...

If we check the second element in the tuple, we can see that there are no longer any warnings or errors detected!

[10]:

data_check_results = results_cleaned[1]
data_check_results

[10]:

[]

Only addressing DataCheck errors#

Previously, we used make_pipeline_from_actions to address all of the warnings and errors returned by search_iterative. We will now show how we can also manually address errors to allow AutoMLSearch to run, and how ignoring warnings will come at the expense of performance.

We can print out the errors first to make it easier to read, and then we’ll create new features and targets from the original training data.

[11]:

errors = [message for message in messages if message["level"] == "error"]
errors

[11]:

[{'message': '1 row(s) (0.08333333333333334%) of target values are null',
  'data_check_name': 'InvalidTargetDataCheck',
  'level': 'error',
  'details': {'columns': None,
   'rows': [990],
   'num_null_rows': 1,
   'pct_null_rows': 0.08333333333333334},
  'code': 'TARGET_HAS_NULL',
  'action_options': [{'code': 'DROP_ROWS',
    'data_check_name': 'InvalidTargetDataCheck',
    'metadata': {'columns': None, 'rows': [990], 'is_target': True},
    'parameters': {}}]}]

[12]:

# copy the DataTables to new variables
X_train_no_errors = X_train.copy()
y_train_no_errors = y_train.copy()

# We address the errors by looking at the resulting dictionary errors listed

# let's address the `TARGET_HAS_NULL` error
y_train_no_errors.fillna(False, inplace=True)

# let's reinitialize the Woodwork DataTable
X_train_no_errors.ww.init()
X_train_no_errors.head()

[12]:

	card_id	store_id	datetime	amount	currency	customer_present	expiration_date	provider	lat	lng	region	country	no_variance	mostly_nulls
id
872	15492	2868	2019-08-03 02:50:04	80719	HNL	True	08/27	American Express	5.47090	100.24529	Batu Feringgi	MY	1	<NA>
1477	<NA>	<NA>	NaT	<NA>	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	<NA>
158	22440	6813	2019-07-12 11:07:25	1849	SEK	True	09/20	American Express	26.26490	81.54855	Jais	IN	1	<NA>
808	8096	8096	2019-06-11 21:33:36	41358	MOP	True	04/29	VISA 13 digit	59.37722	28.19028	Narva	EE	1	<NA>
336	33270	1529	2019-03-23 21:44:00	32594	CUC	False	04/22	Mastercard	51.39323	0.47713	Strood	GB	1	<NA>

We can now run search on X_train_no_errors and y_train_no_errors. Note that the search here doesn’t fail since we addressed the errors, but there will still exist warnings in the returned tuple. This search allows the mostly_nulls column to remain in the features during search.

[13]:

results_no_errors = search_iterative(
    X_train_no_errors, y_train_no_errors, problem_type="binary"
)
results_no_errors

One or more pairs of columns did not share enough rows of non-null data to measure the relationship.  The measurement for these columns will be NaN.  Use 'extra_stats=True' to get the shared rows for each pair of columns.

[13]:

(<evalml.automl.automl_search.AutoMLSearch at 0x7f01239ae1f0>,
 [{'message': '1 out of 1200 rows are 95.0% or more null',
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': None,
    'rows': [1477],
    'pct_null_cols': id
    1477    1.0
    dtype: float64},
   'code': 'HIGHLY_NULL_ROWS',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': None, 'rows': [1477]},
     'parameters': {}}]},
  {'message': "Column(s) 'mostly_nulls' are 95.0% or more null",
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': ['mostly_nulls'],
    'rows': None,
    'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
   'code': 'HIGHLY_NULL_COLS',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': ['mostly_nulls'], 'rows': None},
     'parameters': {}}]},
  {'message': "'no_variance' has 1 unique value.",
   'data_check_name': 'NoVarianceDataCheck',
   'level': 'warning',
   'details': {'columns': ['no_variance'], 'rows': None},
   'code': 'NO_VARIANCE',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NoVarianceDataCheck',
     'metadata': {'columns': ['no_variance'], 'rows': None},
     'parameters': {}}]}])