Data Checks

Data checks.

Package Contents

Classes Summary

ClassImbalanceDataCheck

Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.

DataCheck

Base class for all data checks.

DataCheckAction

A recommended action returned by a DataCheck.

DataCheckActionCode

Enum for data check action code.

DataCheckActionOption

A recommended action option returned by a DataCheck.

DataCheckError

DataCheckMessage subclass for errors returned by data checks.

DataCheckMessage

Base class for a message returned by a DataCheck, tagged by name.

DataCheckMessageCode

Enum for data check message code.

DataCheckMessageType

Enum for type of data check message: WARNING or ERROR.

DataChecks

A collection of data checks.

DataCheckWarning

DataCheckMessage subclass for warnings returned by data checks.

DateTimeFormatDataCheck

Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.

DateTimeNaNDataCheck

Check each column in the input for datetime features and will issue an error if NaN values are present.

DefaultDataChecks

A collection of basic data checks that is used by AutoML by default.

HighlyNullDataCheck

Check if there are any highly-null columns and rows in the input.

IDColumnsDataCheck

Check if any of the features are likely to be ID columns.

InvalidTargetDataCheck

Check if the target data is considered invalid.

MulticollinearityDataCheck

Check if any set features are likely to be multicollinear.

NaturalLanguageNaNDataCheck

Checks each column in the input for natural language features and will issue an error if NaN values are present.

NoVarianceDataCheck

Check if the target or any of the features have no variance.

OutliersDataCheck

Checks if there are any outliers in input data by using IQR to determine score anomalies.

SparsityDataCheck

Check if there are any columns with sparsely populated values in the input.

TargetDistributionDataCheck

Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera.

TargetLeakageDataCheck

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

TimeSeriesParametersDataCheck

Checks whether the time series parameters are compatible with data splitting.

TimeSeriesSplittingDataCheck

Checks whether the time series target data is compatible with splitting.

UniquenessDataCheck

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Contents

class evalml.data_checks.ClassImbalanceDataCheck(threshold=0.1, min_samples=100, num_cv_folds=3)[source]

Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.

Parameters
  • threshold (float) – The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10.

  • min_samples (int) – The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100.

  • num_cv_folds (int) – The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3.

Methods

name

Return a name describing the data check.

validate

Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems.

Ignores NaN values in target labels if they appear.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features. Ignored.

  • y (pd.Series, np.ndarray) – Target labels to check for imbalanced data.

Returns

Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,

and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

dict

Examples

>>> import pandas as pd
...
>>> X = pd.DataFrame()
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In this binary example, the target class 0 is present in fewer than 10% (threshold=0.10) of instances, and fewer than 2 * the number of cross folds (2 * 3 = 6). Therefore, both a warning and an error are returned as part of the Class Imbalance Data Check. In addition, if a target is present with fewer than min_samples occurrences (default is 100) and is under the threshold, a severe class imbalance warning will be raised.

>>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.10)
>>> assert class_imb_dc.validate(X, y) == {
...     "errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]",
...                 "data_check_name": "ClassImbalanceDataCheck",
...                 "level": "error",
...                 "code": "CLASS_IMBALANCE_BELOW_FOLDS",
...                 "details": {"target_values": [0], "rows": None, "columns": None}}],
...     "warnings": [{"message": "The following labels fall below 10% of the target: [0]",
...                   "data_check_name": "ClassImbalanceDataCheck",
...                   "level": "warning",
...                   "code": "CLASS_IMBALANCE_BELOW_THRESHOLD",
...                   "details": {"target_values": [0], "rows": None, "columns": None}},
...                   {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]",
...                   "data_check_name": "ClassImbalanceDataCheck",
...                   "level": "warning",
...                   "code": "CLASS_IMBALANCE_SEVERE",
...                   "details": {"target_values": [0], "rows": None, "columns": None}}],
...      "actions": []}

In this multiclass example, the target class 0 is present in fewer than 30% of observations, however with 1 cv fold, the minimum number of instances required is 2 * 1 = 2. Therefore a warning, but not an error, is raised.

>>> y = pd.Series([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2])
>>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.30, min_samples=5, num_cv_folds=1)
>>> assert class_imb_dc.validate(X, y) == {
...     'warnings': [{'message': 'The following labels fall below 30% of the target: [0]',
...                    'data_check_name': 'ClassImbalanceDataCheck',
...                    'level': 'warning',
...                    'code': 'CLASS_IMBALANCE_BELOW_THRESHOLD',
...                    'details': {'target_values': [0], "rows": None, "columns": None}},
...                    {'message': 'The following labels in the target have severe class imbalance because they fall under 30% of the target and have less than 5 samples: [0]',
...                     'data_check_name': 'ClassImbalanceDataCheck',
...                     'level': 'warning',
...                     'code': 'CLASS_IMBALANCE_SEVERE',
...                     'details': {'target_values': [0], "rows": None, "columns": None}}],
...     'errors': [],
...     'actions': []}
...
...
>>> y = pd.Series([0, 0, 1, 1, 1, 1, 2, 2, 2, 2])
>>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.30, num_cv_folds=1)
>>> assert class_imb_dc.validate(X, y) == {'warnings': [], 'errors': [], 'actions': []}
class evalml.data_checks.DataCheck[source]

Base class for all data checks.

Data checks are a set of heuristics used to determine if there are problems with input data.

Methods

name

Return a name describing the data check.

validate

Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.

name(cls)

Return a name describing the data check.

abstract validate(self, X, y=None)[source]

Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.

Parameters
  • X (pd.DataFrame) – The input data of shape [n_samples, n_features]

  • y (pd.Series, optional) – The target data of length [n_samples]

Returns

Dictionary of DataCheckError and DataCheckWarning messages

Return type

dict (DataCheckMessage)

class evalml.data_checks.DataCheckAction(action_code, data_check_name, metadata=None)[source]

A recommended action returned by a DataCheck.

Parameters
  • action_code (str, DataCheckActionCode) – Action code associated with the action.

  • data_check_name (str) – Name of data check.

  • metadata (dict, optional) – Additional useful information associated with the action. Defaults to None.

Methods

convert_dict_to_action

Convert a dictionary into a DataCheckAction.

to_dict

Return a dictionary form of the data check action.

static convert_dict_to_action(action_dict)[source]

Convert a dictionary into a DataCheckAction.

Parameters

action_dict – Dictionary to convert into action. Should have keys “code”, “data_check_name”, and “metadata”.

Raises

ValueError – If input dictionary does not have keys code and metadata and if the metadata dictionary does not have keys columns and rows.

Returns

DataCheckAction object from the input dictionary.

to_dict(self)[source]

Return a dictionary form of the data check action.

class evalml.data_checks.DataCheckActionCode[source]

Enum for data check action code.

Attributes

DROP_COL

Action code for dropping a column.

DROP_ROWS

Action code for dropping rows.

IMPUTE_COL

Action code for imputing a column.

TRANSFORM_TARGET

Action code for transforming the target data.

Methods

name

The name of the Enum member.

value

The value of the Enum member.

name(self)

The name of the Enum member.

value(self)

The value of the Enum member.

class evalml.data_checks.DataCheckActionOption(action_code, data_check_name, parameters=None, metadata=None)[source]

A recommended action option returned by a DataCheck.

It contains an action code that indicates what the action should be, a data check name that indicates what data check was used to generate the action, and parameters and metadata which can be used to further refine the action.

Parameters
  • action_code (DataCheckActionCode) – Action code associated with the action option.

  • data_check_name (str) – Name of the data check that produced this option.

  • parameters (dict) – Parameters associated with the action option. Defaults to None.

  • metadata (dict, optional) – Additional useful information associated with the action option. Defaults to None.

Examples

>>> parameters = {
...     "global_parameter_name": {
...         "parameter_type": "global",
...         "type": "float",
...         "default_value": 0.0,
...     },
...     "column_parameter_name": {
...         "parameter_type": "column",
...         "columns": {
...             "a": {
...                 "impute_strategy": {
...                     "categories": ["mean", "mode"],
...                     "type": "category",
...                     "default_value": "mean",
...                 },
...             "constant_fill_value": {"type": "float", "default_value": 0},
...             },
...         },
...     },
... }
>>> data_check_action = DataCheckActionOption(DataCheckActionCode.DROP_COL, None, metadata={}, parameters=parameters)

Methods

convert_dict_to_action

Convert a dictionary into a DataCheckActionOption.

to_dict

Return a dictionary form of the data check action option.

static convert_dict_to_action(action_dict)[source]

Convert a dictionary into a DataCheckActionOption.

Parameters

action_dict – Dictionary to convert into action. Should have keys “code”, “data_check_name”, and “metadata”.

Raises

ValueError – If input dictionary does not have keys code and metadata and if the metadata dictionary does not have keys columns and rows.

Returns

DataCheckActionOption object from the input dictionary.

to_dict(self)[source]

Return a dictionary form of the data check action option.

class evalml.data_checks.DataCheckError(message, data_check_name, message_code=None, details=None)[source]

DataCheckMessage subclass for errors returned by data checks.

Attributes

message_type

DataCheckMessageType.ERROR

Methods

to_dict

Return a dictionary form of the data check message.

to_dict(self)

Return a dictionary form of the data check message.

class evalml.data_checks.DataCheckMessage(message, data_check_name, message_code=None, details=None)[source]

Base class for a message returned by a DataCheck, tagged by name.

Parameters
  • message (str) – Message string.

  • data_check_name (str) – Name of data check.

  • message_code (DataCheckMessageCode) – Message code associated with message. Defaults to None.

  • details (dict) – Additional useful information associated with the message. Defaults to None.

Attributes

message_type

None

Methods

to_dict

Return a dictionary form of the data check message.

to_dict(self)[source]

Return a dictionary form of the data check message.

class evalml.data_checks.DataCheckMessageCode[source]

Enum for data check message code.

Attributes

CLASS_IMBALANCE_BELOW_FOLDS

Message code for when the number of values for each target is below 2 * number of CV folds.

CLASS_IMBALANCE_BELOW_THRESHOLD

Message code for when balance in classes is less than the threshold.

CLASS_IMBALANCE_SEVERE

Message code for when balance in classes is less than the threshold and minimum class is less than minimum number of accepted samples.

DATETIME_HAS_NAN

Message code for when input datetime columns contain NaN values.

DATETIME_HAS_UNEVEN_INTERVALS

Message code for when the datetime values have uneven intervals.

DATETIME_INFORMATION_NOT_FOUND

Message code for when datetime information can not be found or is in an unaccepted format.

DATETIME_IS_NOT_MONOTONIC

Message code for when the datetime values are not monotonically increasing.

HAS_ID_COLUMN

Message code for data that has ID columns.

HAS_OUTLIERS

Message code for when outliers are detected.

HIGH_VARIANCE

Message code for when high variance is detected for cross-validation.

HIGHLY_NULL_COLS

Message code for highly null columns.

HIGHLY_NULL_ROWS

Message code for highly null rows.

IS_MULTICOLLINEAR

Message code for when data is potentially multicollinear.

MISMATCHED_INDICES

Message code for when input target and features have mismatched indices.

MISMATCHED_INDICES_ORDER

Message code for when input target and features have mismatched indices order. The two inputs have the same index values, but shuffled.

MISMATCHED_LENGTHS

Message code for when input target and features have different lengths.

NATURAL_LANGUAGE_HAS_NAN

Message code for when input natural language columns contain NaN values.

NO_VARIANCE

Message code for when data has no variance (1 unique value).

NO_VARIANCE_WITH_NULL

Message code for when data has one unique value and NaN values.

NOT_UNIQUE_ENOUGH

Message code for when data does not possess enough unique values.

TARGET_BINARY_NOT_TWO_UNIQUE_VALUES

Message code for target data for a binary classification problem that does not have two unique values.

TARGET_HAS_NULL

Message code for target data that has null values.

TARGET_INCOMPATIBLE_OBJECTIVE

Message code for target data that has incompatible values for the specified objective

TARGET_IS_EMPTY_OR_FULLY_NULL

Message code for target data that is empty or has all null values.

TARGET_IS_NONE

Message code for when target is None.

TARGET_LEAKAGE

Message code for when target leakage is detected.

TARGET_LOGNORMAL_DISTRIBUTION

Message code for target data with a lognormal distribution.

TARGET_MULTICLASS_HIGH_UNIQUE_CLASS

Message code for target data for a multi classification problem that has an abnormally large number of unique classes relative to the number of target values.

TARGET_MULTICLASS_NOT_ENOUGH_CLASSES

Message code for target data for a multi classification problem that does not have more than two unique classes.

TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS

Message code for target data for a multi classification problem that does not have two examples per class.

TARGET_UNSUPPORTED_PROBLEM_TYPE

Message code for target data that is being checked against an unsupported problem type.

TARGET_UNSUPPORTED_TYPE

Message code for target data that is of an unsupported type.

TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT

Message code when the time series parameters are too large for the smallest data split.

TIMESERIES_TARGET_NOT_COMPATIBLE_WITH_SPLIT

Message code when any training and validation split of the time series target doesn’t contain all classes.

TOO_SPARSE

Message code for when multiclass data has values that are too sparsely populated.

TOO_UNIQUE

Message code for when data possesses too many unique values.

Methods

name

The name of the Enum member.

value

The value of the Enum member.

name(self)

The name of the Enum member.

value(self)

The value of the Enum member.

class evalml.data_checks.DataCheckMessageType[source]

Enum for type of data check message: WARNING or ERROR.

Attributes

ERROR

Error message returned by a data check.

WARNING

Warning message returned by a data check.

Methods

name

The name of the Enum member.

value

The value of the Enum member.

name(self)

The name of the Enum member.

value(self)

The value of the Enum member.

class evalml.data_checks.DataChecks(data_checks=None, data_check_params=None)[source]

A collection of data checks.

Parameters
  • data_checks (list (DataCheck)) – List of DataCheck objects.

  • data_check_params (dict) – Parameters for passed DataCheck objects.

Methods

validate

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

validate(self, X, y=None)[source]

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

Parameters
  • X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]

  • y (pd.Series, np.ndarray) – The target data of length [n_samples]

Returns

Dictionary containing DataCheckMessage objects

Return type

dict

class evalml.data_checks.DataCheckWarning(message, data_check_name, message_code=None, details=None)[source]

DataCheckMessage subclass for warnings returned by data checks.

Attributes

message_type

DataCheckMessageType.WARNING

Methods

to_dict

Return a dictionary form of the data check message.

to_dict(self)

Return a dictionary form of the data check message.

class evalml.data_checks.DateTimeFormatDataCheck(datetime_column='index')[source]

Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.

Parameters

datetime_column (str, int) – The name of the datetime column. If the datetime values are in the index, then pass “index”.

Methods

name

Return a name describing the data check.

validate

Checks if the target data has equal intervals and is sorted.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Checks if the target data has equal intervals and is sorted.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Target data.

Returns

List with DataCheckErrors if unequal intervals are found in the datetime column.

Return type

dict (DataCheckError)

Examples

>>> import pandas as pd

The column “dates” has the date 2021-01-31 appended to the end, which is inconsistent with the frequency of the previous 9 dates (1 day).

>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-31", periods=1)), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == {
...     "errors": [{"message": "No frequency could be detected in dates, possibly due to uneven intervals.",
...                 "data_check_name": "DateTimeFormatDataCheck",
...                 "level": "error",
...                 "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...                 "details": {"columns": None, "rows": None}
...                 }],
...     "warnings": [],
...     "actions": []}

The column “Weeks” passed integers instead of datetime data, which will raise an error.

>>> X = pd.DataFrame([1, 2, 3, 4], columns=["Weeks"])
>>> y = pd.Series([0] * 4)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Datetime information could not be found in the data, or was not in a supported datetime format.',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None},
...                 'code': 'DATETIME_INFORMATION_NOT_FOUND'}],
...     'actions': []}

Converting that same integer data to datetime however is valid.

>>> X = pd.DataFrame(pd.to_datetime([1, 2, 3, 4]), columns=["Weeks"])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == {'warnings': [], 'errors': [], 'actions': []}
>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq='W', periods=10), columns=["Weeks"])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == {'warnings': [], 'errors': [], 'actions': []}

While the data passed in is of datetime type, time series requires the datetime information in datetime_column to be monotonically increasing (ascending).

>>> X = X.iloc[::-1]
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Datetime values must be sorted in ascending order.',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None},
...                 'code': 'DATETIME_IS_NOT_MONOTONIC'}],
...     'actions': []}
class evalml.data_checks.DateTimeNaNDataCheck[source]

Check each column in the input for datetime features and will issue an error if NaN values are present.

Methods

name

Return a name describing the data check.

validate

Check if any datetime columns contain NaN values.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if any datetime columns contain NaN values.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if NaN values are present in datetime columns.

Return type

dict

Examples

>>> import pandas as pd
>>> import numpy as np
...
>>> dates = [['2-1-21', '3-1-21'],
...         ['2-2-21', '3-2-21'],
...         ['2-3-21', '3-3-21'],
...         ['2-4-21', '3-4-21']]
>>> df = pd.DataFrame(dates, columns=['index', "days"])
>>> dt_nan_dc = DateTimeNaNDataCheck()
>>> assert dt_nan_dc.validate(df) == {'warnings': [], 'errors': [], 'actions': []}

The first value in the column “index” is replaced with NaT, which will raise an error in this data check.

>>> dates[0][0] = np.datetime64('NaT')
>>> df = pd.DataFrame(dates, columns=['index', "days"])
>>> assert dt_nan_dc.validate(df) == {
...     'warnings': [],
...     'errors': [{'message': 'Input datetime column(s) (index) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                 'data_check_name': 'DateTimeNaNDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['index'], 'rows': None},
...                 'code': 'DATETIME_HAS_NAN'}],
...     'actions': []}

The value None will be treated the same way.

>>> dates[0][1] = None
>>> df = pd.DataFrame(dates, columns=['index', "days"])
>>> assert dt_nan_dc.validate(df) == {
...     'warnings': [],
...     'errors': [{'message': 'Input datetime column(s) (index, days) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                 'data_check_name': 'DateTimeNaNDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['index', 'days'], 'rows': None},
...                 'code': 'DATETIME_HAS_NAN'}],
...     'actions': []}

As will pd.NA.

>>> dates[0][1] = pd.NA
>>> df = pd.DataFrame(dates, columns=['index', "days"])
>>> assert dt_nan_dc.validate(df) == {
...     'warnings': [],
...     'errors': [{'message': 'Input datetime column(s) (index, days) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                 'data_check_name': 'DateTimeNaNDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['index', 'days'], 'rows': None},
...                 'code': 'DATETIME_HAS_NAN'}],
...     'actions': []}
class evalml.data_checks.DefaultDataChecks(problem_type, objective, n_splits=3, problem_configuration=None)[source]

A collection of basic data checks that is used by AutoML by default.

Includes:

  • HighlyNullDataCheck

  • HighlyNullRowsDataCheck

  • IDColumnsDataCheck

  • TargetLeakageDataCheck

  • InvalidTargetDataCheck

  • NoVarianceDataCheck

  • ClassImbalanceDataCheck (for classification problem types)

  • DateTimeNaNDataCheck

  • NaturalLanguageNaNDataCheck

  • TargetDistributionDataCheck (for regression problem types)

  • DateTimeFormatDataCheck (for time series problem types)

  • ‘TimeSeriesParametersDataCheck’ (for time series problem types)

  • TimeSeriesSplittingDataCheck (for time series classification problem types)

Parameters
  • problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.

  • objective (str or ObjectiveBase) – Name or instance of the objective class.

  • n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.

  • datetime_column (str) – The name of the column containing datetime information to be used for time series problems.

  • to "index" indicating that the datetime information is in the index of X or y. (Default) –

Methods

validate

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

validate(self, X, y=None)

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

Parameters
  • X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]

  • y (pd.Series, np.ndarray) – The target data of length [n_samples]

Returns

Dictionary containing DataCheckMessage objects

Return type

dict

class evalml.data_checks.HighlyNullDataCheck(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]

Check if there are any highly-null columns and rows in the input.

Parameters
  • pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.

  • pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.

Methods

get_null_column_information

Finds columns that are considered highly null (percentage null is greater than threshold) and returns dictionary mapping column name to percentage null and dictionary mapping column name to null indices.

get_null_row_information

Finds rows that are considered highly null (percentage null is greater than threshold).

name

Return a name describing the data check.

validate

Check if there are any highly-null columns or rows in the input.

static get_null_column_information(X, pct_null_col_threshold=0.0)[source]

Finds columns that are considered highly null (percentage null is greater than threshold) and returns dictionary mapping column name to percentage null and dictionary mapping column name to null indices.

Parameters
  • X (pd.DataFrame) – DataFrame to check for highly null columns.

  • pct_null_col_threshold (float) – Percentage threshold for a column to be considered null. Defaults to 0.0.

Returns

Tuple containing: dictionary mapping column name to its null percentage and dictionary mapping column name to null indices in that column.

Return type

tuple

static get_null_row_information(X, pct_null_row_threshold=0.0)[source]

Finds rows that are considered highly null (percentage null is greater than threshold).

Parameters
  • X (pd.DataFrame) – DataFrame to check for highly null rows.

  • pct_null_row_threshold (float) – Percentage threshold for a row to be considered null. Defaults to 0.0.

Returns

Series containing the percentage null for each row.

Return type

pd.Series

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if there are any highly-null columns or rows in the input.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any highly-null columns or rows.

Return type

dict

Examples

>>> import pandas as pd
...
>>> class SeriesWrap():
...     def __init__(self, series):
...         self.series = series
...
...     def __eq__(self, series_2):
...         return all(self.series.eq(series_2.series))

With pct_null_col_threshold set to 0.50, any column that has 50% or more of its observations set to null will be included in the warning, as well as the percentage of null values identified (“all_null”: 1.0, “lots_of_null”: 0.8).

>>> df = pd.DataFrame({
...     'all_null': [None, pd.NA, None, None, None],
...     'lots_of_null': [None, None, None, None, 5],
...     'few_null': ["near", "far", pd.NaT, "wherever", "nowhere"],
...     'no_null': [1, 2, 3, 4, 5]
... })
...
>>> highly_null_dc = HighlyNullDataCheck(pct_null_col_threshold=0.50)
>>> assert highly_null_dc.validate(df) == {
...     'warnings': [{'message': "Columns 'all_null', 'lots_of_null' are 50.0% or more null",
...                   'data_check_name': 'HighlyNullDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['all_null', 'lots_of_null'],
...                               'rows': None,
...                               'pct_null_rows': {'all_null': 1.0, 'lots_of_null': 0.8}},
...                   'code': 'HIGHLY_NULL_COLS'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'HighlyNullDataCheck',
...                  'metadata': {'columns': ['all_null', 'lots_of_null'], 'rows': None}}]}

With pct_null_row_threshold set to 0.50, any row with 50% or more of its respective column values set to null will included in the warning, as well as the offending rows (“rows”: [0, 1, 2, 3]). Since the default value for pct_null_col_threshold is 0.95, “all_null” is also included in the warnings since the percentage of null values in that row is over 95%.

>>> highly_null_dc = HighlyNullDataCheck(pct_null_row_threshold=0.50)
>>> validation_results = highly_null_dc.validate(df)
>>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols'])
>>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.75, 0.5]))
>>> assert validation_results == {
...     'warnings': [{'message': '4 out of 5 rows are 50.0% or more null',
...                   'data_check_name': 'HighlyNullDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': None,
...                               'rows': [0, 1, 2, 3],
...                               'pct_null_cols': highly_null_rows},
...                   'code': 'HIGHLY_NULL_ROWS'},
...                  {'message': "Columns 'all_null' are 95.0% or more null",
...                   'data_check_name': 'HighlyNullDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['all_null'],
...                               'rows': None,
...                               'pct_null_rows': {'all_null': 1.0}},
...                   'code': 'HIGHLY_NULL_COLS'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_ROWS',
...                  'data_check_name': 'HighlyNullDataCheck',
...                  'metadata': {'columns': None, 'rows': [0, 1, 2, 3]}},
...                 {'code': 'DROP_COL',
...                  'data_check_name': 'HighlyNullDataCheck',
...                  'metadata': {'columns': ['all_null'], 'rows': None}}]}
class evalml.data_checks.IDColumnsDataCheck(id_threshold=1.0)[source]

Check if any of the features are likely to be ID columns.

Parameters

id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.

Methods

name

Return a name describing the data check.

validate

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

Checks performed are:

  • column name is “id”

  • column name ends in “_id”

  • column contains all unique values (and is categorical / integer type)

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features to check.

  • y (pd.Series) – The target. Defaults to None. Ignored.

Returns

A dictionary of features with column name or index and their probability of being ID columns

Return type

dict

Examples

>>> import pandas as pd

Columns that end in “_id” and are completely unique are likely to be ID columns.

>>> df = pd.DataFrame({
...     'customer_id': [123, 124, 125, 126, 127],
...     'Sales': [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
...                   "data_check_name": "IDColumnsDataCheck",
...                   "level": "warning",
...                   "code": "HAS_ID_COLUMN",
...                   "details": {"columns": ["customer_id"], "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": "IDColumnsDataCheck",
...                  "metadata": {"columns": ["customer_id"], "rows": None}}]}

Ccolumns named “ID” with all unique values will also be identified as ID columns.

>>> df = df.rename(columns={"customer_id": "ID"})
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Columns 'ID' are 100.0% or more likely to be an ID column",
...                   "data_check_name": "IDColumnsDataCheck",
...                   "level": "warning",
...                   "code": "HAS_ID_COLUMN",
...                   "details": {"columns": ["ID"], "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": "IDColumnsDataCheck",
...                  "metadata": {"columns": ["ID"], "rows": None}}]}

Despite being all unique, “Country_Rank” will not be identified as an ID column as id_threshold is set to 1.0 by default and its name doesn’t indicate that it’s an ID.

>>> df = pd.DataFrame({
...    'Country_Rank': [1, 2, 3, 4, 5],
...    'Sales': ["very high", "high", "high", "medium", "very low"]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == {'warnings': [], 'errors': [], 'actions': []}

However lowering the threshold will cause this column to be identified as an ID.

>>> id_col_check = IDColumnsDataCheck()
>>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
>>> assert id_col_check.validate(df) == {
...     'warnings': [{'message': "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
...                   'data_check_name': 'IDColumnsDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['Country_Rank'], 'rows': None},
...                   'code': 'HAS_ID_COLUMN'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'IDColumnsDataCheck',
...                  'metadata': {'columns': ['Country_Rank'], 'rows': None}}]}
class evalml.data_checks.InvalidTargetDataCheck(problem_type, objective, n_unique=100)[source]

Check if the target data is considered invalid.

Target data is considered invalid if:
  • Target is None.

  • Target has NaN or None values.

  • Target is of an unsupported Woodwork logical type.

  • Target and features have different lengths or indices.

  • Target does not have enough instances of a class in a classification problem.

  • Target does not contain numeric data for regression problems.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’

  • objective (str or ObjectiveBase) – Name or instance of the objective class.

  • n_unique (int) – Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100.

Attributes

multiclass_continuous_threshold

0.05

Methods

name

Return a name describing the data check.

validate

Check if the target data is considered invalid. If the input features argument is not None, it will be used to check that the target and features have the same dimensions and indices.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if the target data is considered invalid. If the input features argument is not None, it will be used to check that the target and features have the same dimensions and indices.

Target data is considered invalid if:
  • Target is None.

  • Target has NaN or None values.

  • Target is of an unsupported Woodwork logical type.

  • Target and features have different lengths or indices.

  • Target does not have enough instances of a class in a classification problem.

  • Target does not contain numeric data for regression problems.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features. If not None, will be used to check that the target and features have the same dimensions and indices.

  • y (pd.Series, np.ndarray) – Target data to check for invalid values.

Returns

List with DataCheckErrors if any invalid values are found in the target data.

Return type

dict (DataCheckError)

Examples

>>> import pandas as pd

Target values must be integers, doubles, or booleans.

>>> X = pd.DataFrame({"col": [1, 2, 3, 1]})
>>> y = pd.Series(["cat_1", "cat_2", "cat_1", "cat_2"])
>>> target_check = InvalidTargetDataCheck('regression', 'R2')
>>> assert target_check.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target is unsupported Unknown type. Valid Woodwork logical types include: integer, double, boolean, integer_nullable, boolean_nullable, age_nullable',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None, 'unsupported_type': 'unknown'},
...                 'code': 'TARGET_UNSUPPORTED_TYPE'},
...                {'message': 'Target data type should be numeric for regression type problems.',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None},
...                 'code': 'TARGET_UNSUPPORTED_TYPE'}],
...     'actions': []}

The target cannot have null values.

>>> y = pd.Series([None, pd.NA, pd.NaT, None])
>>> assert target_check.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target is either empty or fully null.',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None},
...                 'code': 'TARGET_IS_EMPTY_OR_FULLY_NULL'}],
...     'actions': []}
...
...
>>> y = pd.Series([1, None, 3, None])
>>> assert target_check.validate(None, y) == {
...     'warnings': [],
...     'errors': [{'message': '2 row(s) (50.0%) of target values are null',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None,
...                             'rows': None,
...                             'num_null_rows': 2,
...                             'pct_null_rows': 50.0},
...                 'code': 'TARGET_HAS_NULL'}],
...     'actions': [{'code': 'IMPUTE_COL',
...                  'data_check_name': 'InvalidTargetDataCheck',
...                  'metadata': {'columns': None,
...                               'rows': None,
...                               'is_target': True,
...                               'impute_strategy': 'mean'}}]}

If the target values don’t match the problem type passed, an error will be raised. In this instance, only two values exist in the target column, but multiclass has been passed as the problem type.

>>> X = pd.DataFrame([i for i in range(50)])
>>> y = pd.Series([i%2 for i in range(50)])
>>> target_check = InvalidTargetDataCheck('multiclass', 'Log Loss Multiclass')
>>> assert target_check.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target has two or less classes, which is too few for multiclass problems.  Consider changing to binary.',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None, 'num_classes': 2},
...                 'code': 'TARGET_MULTICLASS_NOT_ENOUGH_CLASSES'}],
...     'actions': []}

If the length of X and y differ, a warning will be raised. A warning will also be raised for indices that don’t match.

>>> target_check = InvalidTargetDataCheck('regression', 'R2')
>>> X = pd.DataFrame([i for i in range(5)])
>>> y = pd.Series([1, 2, 4, 3], index=[1, 2, 4, 3])
>>> assert target_check.validate(X, y) == {
...     'warnings': [{'message': 'Input target and features have different lengths',
...                   'data_check_name': 'InvalidTargetDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': None,
...                               'rows': None,
...                               'features_length': 5,
...                               'target_length': 4},
...                   'code': 'MISMATCHED_LENGTHS'},
...                  {'message': 'Input target and features have mismatched indices. Details will include the first 10 mismatched indices.',
...                   'data_check_name': 'InvalidTargetDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': None,
...                               'rows': None,
...                               'indices_not_in_features': [],
...                               'indices_not_in_target': [0]},
...                   'code': 'MISMATCHED_INDICES'}],
...     'errors': [],
...     'actions': []}
class evalml.data_checks.MulticollinearityDataCheck(threshold=0.9)[source]

Check if any set features are likely to be multicollinear.

Parameters

threshold (float) – The threshold to be considered. Defaults to 0.9.

Methods

name

Return a name describing the data check.

validate

Check if any set of features are likely to be multicollinear.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if any set of features are likely to be multicollinear.

Parameters
  • X (pd.DataFrame) – The input features to check.

  • y (pd.Series) – The target. Ignored.

Returns

dict with a DataCheckWarning if there are any potentially multicollinear columns.

Return type

dict

Example

>>> import pandas as pd

Columns in X that are highly correlated with each other will be identified using mutual information.

>>> col = pd.Series([1, 0, 2, 3, 4])
>>> X = pd.DataFrame({"col_1": col, "col_2": col * 3})
>>> y = pd.Series([1, 0, 0, 1, 0])
...
>>> multicollinearity_check = MulticollinearityDataCheck(threshold=1.0)
>>> assert multicollinearity_check.validate(X, y) == {
...     "errors": [],
...     "warnings": [{'message': "Columns are likely to be correlated: [('col_1', 'col_2')]",
...                   "data_check_name": "MulticollinearityDataCheck",
...                   "level": "warning",
...                   "code": "IS_MULTICOLLINEAR",
...                   'details': {"columns": [('col_1', 'col_2')], "rows": None}}],
...     "actions": []}
class evalml.data_checks.NaturalLanguageNaNDataCheck[source]

Checks each column in the input for natural language features and will issue an error if NaN values are present.

Methods

name

Return a name describing the data check.

validate

Check if any natural language columns contain NaN values.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if any natural language columns contain NaN values.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if NaN values are present in natural language columns.

Return type

dict

Example

>>> import pandas as pd
>>> import woodwork as ww
>>> import numpy as np

Columns containing Natural Language data will raise an error if NaN values are present.

>>> data = pd.DataFrame()
>>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"]
>>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language']
>>> data['C'] = np.random.randint(0, 3, size=len(data))
>>> data.ww.init(logical_types={'A': 'NaturalLanguage', 'B': 'NaturalLanguage'})
...
>>> nl_nan_check = NaturalLanguageNaNDataCheck()
>>> assert nl_nan_check.validate(data) == {
...        "warnings": [],
...        "actions": [],
...        "errors": [DataCheckError(message='Input natural language column(s) (A) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                      data_check_name=NaturalLanguageNaNDataCheck.name,
...                      message_code=DataCheckMessageCode.NATURAL_LANGUAGE_HAS_NAN,
...                      details={"columns": ['A']}).to_dict()]
...    }
class evalml.data_checks.NoVarianceDataCheck(count_nan_as_value=False)[source]

Check if the target or any of the features have no variance.

Parameters

count_nan_as_value (bool) – If True, missing values will be counted as their own unique value. Additionally, if true, will return a DataCheckWarning instead of an error if the feature has mostly missing data and only one unique value. Defaults to False.

Methods

name

Return a name describing the data check.

validate

Check if the target or any of the features have no variance (1 unique value).

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if the target or any of the features have no variance (1 unique value).

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features.

  • y (pd.Series, np.ndarray) – The target data.

Returns

A dict of warnings/errors corresponding to features or target with no variance.

Return type

dict

Examples

>>> import pandas as pd

Columns or target data that have only one unique value will raise an error.

>>> X = pd.DataFrame([2, 2, 2, 2, 2, 2, 2, 2], columns=["First_Column"])
>>> y = pd.Series([1, 1, 1, 1, 1, 1, 1, 1])
...
>>> novar_dc = NoVarianceDataCheck()
>>> assert novar_dc.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': "'First_Column' has 1 unique value.",
...                 'data_check_name': 'NoVarianceDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['First_Column'], 'rows': None},
...                 'code': 'NO_VARIANCE'},
...                {'message': 'Y has 1 unique value.',
...                 'data_check_name': 'NoVarianceDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['Y'], 'rows': None},
...                 'code': 'NO_VARIANCE'}],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'NoVarianceDataCheck',
...                  'metadata': {'columns': ["First_Column"], 'rows': None}}]}

By default, NaNs will not be counted as distinct values. In the first example, there are still two distinct values besides None. In the second, there are no distinct values as the target is entirely null.

>>> X["First_Column"] = [2, 2, 2, 3, 3, 3, None, None]
>>> y = pd.Series([1, 1, 1, 2, 2, 2, None, None])
>>> assert novar_dc.validate(X, y) == {'warnings': [], 'errors': [], 'actions': []}
...
...
>>> y = pd.Series([None] * 7)
>>> assert novar_dc.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Y has 0 unique values.',
...                 'data_check_name': 'NoVarianceDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['Y'], 'rows': None},
...                 'code': 'NO_VARIANCE'}],
...     'actions': []}

As None is not considered a distinct value by default, there is only one unique value in X and y.

>>> X["First_Column"] = [2, 2, 2, 2, None, None, None, None]
>>> y = pd.Series([1, 1, 1, 1, None, None, None, None])
>>> assert novar_dc.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': "'First_Column' has 1 unique value.",
...                 'data_check_name': 'NoVarianceDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['First_Column'], 'rows': None},
...                 'code': 'NO_VARIANCE'},
...                {'message': 'Y has 1 unique value.',
...                 'data_check_name': 'NoVarianceDataCheck',
...                 'level': 'error',
...                 'details': {'columns': ['Y'], 'rows': None},
...                 'code': 'NO_VARIANCE'}],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'NoVarianceDataCheck',
...                  'metadata': {'columns': ['First_Column'], 'rows': None}}]}

If count_nan_as_value is set to True, then NaNs are counted as unique values. In the event that there is an adequate number of unique values only because count_nan_as_value is set to True, a warning will be raised so the user can encode these values.

>>> novar_dc = NoVarianceDataCheck(count_nan_as_value=True)
>>> assert novar_dc.validate(X, y) == {
...     'warnings': [{'message': "'First_Column' has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.",
...                   'data_check_name': 'NoVarianceDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['First_Column'], 'rows': None},
...                   'code': 'NO_VARIANCE_WITH_NULL'},
...                  {'message': 'Y has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.',
...                   'data_check_name': 'NoVarianceDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['Y'], 'rows': None},
...                   'code': 'NO_VARIANCE_WITH_NULL'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'NoVarianceDataCheck',
...                  'metadata': {'columns': ['First_Column'], 'rows': None}}]}
class evalml.data_checks.OutliersDataCheck[source]

Checks if there are any outliers in input data by using IQR to determine score anomalies.

Columns with score anomalies are considered to contain outliers.

Methods

get_boxplot_data

Returns box plot information for the given data.

name

Return a name describing the data check.

validate

Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.

static get_boxplot_data(data_)[source]

Returns box plot information for the given data.

Parameters

data (pd.Series, np.ndarray) – Input data.

Returns

A payload of box plot statistics.

Return type

dict

Examples

>>> import pandas as pd
...
>>> df = pd.DataFrame({
...     'x': [1, 2, 3, 4, 5],
...     'y': [6, 7, 8, 9, 10],
...     'z': [-1, -2, -3, -1201, -4]
... })
>>> box_plot_data = OutliersDataCheck.get_boxplot_data(df['z'])
>>> box_plot_data["score"] = round(box_plot_data["score"], 2)
>>> assert box_plot_data == {
...     'score': 0.89,
...     'pct_outliers': 0.2,
...     'values': {'q1': -4.0,
...                'median': -3.0,
...                'q3': -2.0,
...                'low_bound': -7.0,
...                'high_bound': 1.0,
...                'low_values': [-1201],
...                'high_values': [],
...                'low_indices': [3],
...                'high_indices': []}
...     }
name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.

Parameters
  • X (pd.DataFrame, np.ndarray) – Input features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

A dictionary with warnings if any columns have outliers.

Return type

dict

Examples

>>> import pandas as pd

The column “z” has an outlier so a warning is added to alert the user of its location.

>>> df = pd.DataFrame({
...     'x': [1, 2, 3, 4, 5],
...     'y': [6, 7, 8, 9, 10],
...     'z': [-1, -2, -3, -1201, -4]
... })
...
>>> outliers_check = OutliersDataCheck()
>>> assert outliers_check.validate(df) == {
...     "warnings": [{"message": "Column(s) 'z' are likely to have outlier data.",
...                   "data_check_name": "OutliersDataCheck",
...                   "level": "warning",
...                   "code": "HAS_OUTLIERS",
...                   "details": {"columns": ["z"], "rows": [3], "column_indices": {"z": [3]}}}],
...     "errors": [],
...     "actions": [{"code": "DROP_ROWS",
...                  "data_check_name": "OutliersDataCheck",
...                  "metadata": {"rows": [3], "columns": None}}]}
class evalml.data_checks.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]

Check if there are any columns with sparsely populated values in the input.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.

  • threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.

  • unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

name

Return a name describing the data check.

sparsity_score

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

validate

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

name(cls)

Return a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters
  • col (pd.Series) – Feature values.

  • count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Examples

>>> import pandas as pd

For multiclass problems, if a column doesn’t have enough representation from unique values, it will be considered sparse.

>>> df = pd.DataFrame({
...    'sparse': [float(x) for x in range(100)],
...    'not_sparse': [float(1) for x in range(100)]
... })
...
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns ('sparse') for multiclass problem type are too sparse.",
...                   "data_check_name": "SparsityDataCheck",
...                    "level": "warning",
...                    "code": "TOO_SPARSE",
...                    "details": {"columns": ["sparse"], "sparsity_score": {"sparse": 0.0}, "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": "SparsityDataCheck",
...                  "metadata": {"columns": ["sparse"], "rows": None}}]}
>>> df['sparse'] = [float(x % 10) for x in range(100)]
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=1, unique_count_threshold=5)
>>> assert sparsity_check.validate(df) == {'warnings': [], 'errors': [], 'actions': []}
>>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3)
>>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666
class evalml.data_checks.TargetDistributionDataCheck[source]

Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera.

Methods

name

Return a name describing the data check.

validate

Check if the target data has a certain distribution.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if the target data has a certain distribution.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features. Ignored.

  • y (pd.Series, np.ndarray) – Target data to check for underlying distributions.

Returns

List with DataCheckErrors if certain distributions are found in the target data.

Return type

dict (DataCheckError)

Examples

>>> import pandas as pd

Targets that exhibit a lognormal distribution will raise a warning for the user to transform the target.

>>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897]
>>> target_check = TargetDistributionDataCheck()
>>> assert target_check.validate(None, y) == {
...     "errors": [],
...     "warnings": [{"message": "Target may have a lognormal distribution.",
...                   "data_check_name": "TargetDistributionDataCheck",
...                   "level": "warning",
...                   "code": "TARGET_LOGNORMAL_DISTRIBUTION",
...                   "details": {"normalization_method": "shapiro", "statistic": 0.8, "p-value": 0.045, "columns": None, "rows": None}}],
...     "actions": [{'code': 'TRANSFORM_TARGET',
...                  "data_check_name": "TargetDistributionDataCheck",
...                  'metadata': {'transformation_strategy': 'lognormal',
...                               'is_target': True,
...                               "columns": None,
...                               "rows": None}}]}
>>> y = pd.Series([1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5])
>>> assert target_check.validate(None, y) == {'warnings': [], 'errors': [], 'actions': []}
>>> y = pd.Series(pd.date_range('1/1/21', periods=10))
>>> assert target_check.validate(None, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target is unsupported datetime type. Valid Woodwork logical types include: integer, double',
...                 'data_check_name': 'TargetDistributionDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None, 'unsupported_type': 'datetime'},
...                 'code': 'TARGET_UNSUPPORTED_TYPE'}],
...     'actions': []}
class evalml.data_checks.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual')[source]

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters
  • pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.

  • method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.

Methods

name

Return a name describing the data check.

validate

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features to check.

  • y (pd.Series, np.ndarray) – The target data.

Returns

dict with a DataCheckWarning if target leakage is detected.

Return type

dict (DataCheckWarning)

Examples

>>> import pandas as pd

Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage.

>>> X = pd.DataFrame({
...    'leak': [10, 42, 31, 51, 61],
...    'x': [42, 54, 12, 64, 12],
...    'y': [13, 5, 13, 74, 24],
... })
>>> y = pd.Series([10, 42, 31, 51, 40])
...
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == {
...     "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target",
...                   "data_check_name": "TargetLeakageDataCheck",
...                   "level": "warning",
...                   "code": "TARGET_LEAKAGE",
...                   "details": {"columns": ["leak"], "rows": None}}],
...     "errors": [],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": "TargetLeakageDataCheck",
...                  "metadata": {"columns": ["leak"], "rows": None}}]}

The default method can be changed to pearson from mutual information.

>>> X['x'] = y / 2
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method='pearson')
>>> assert target_leakage_check.validate(X, y) == {
...     'warnings': [{'message': "Columns 'leak', 'x' are 80.0% or more correlated with the target",
...                   'data_check_name': 'TargetLeakageDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['leak', 'x'], 'rows': None},
...                   'code': 'TARGET_LEAKAGE'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  "data_check_name": "TargetLeakageDataCheck",
...                  'metadata': {'columns': ['leak', 'x'], 'rows': None}}]}
class evalml.data_checks.TimeSeriesParametersDataCheck(problem_configuration, n_splits)[source]

Checks whether the time series parameters are compatible with data splitting.

If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)

then the feature engineering window is larger than the smallest split. This will cause the pipeline to create features from data that does not exist, which will cause errors.

Parameters
  • problem_configuration (dict) – Dict containing problem_configuration parameters.

  • n_splits (int) – Number of time series splits.

Methods

name

Return a name describing the data check.

validate

Check if the time series parameters are compatible with data splitting.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if the time series parameters are compatible with data splitting.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if parameters are too big for the split sizes.

Return type

dict

Examples

>>> import pandas as pd

The time series parameters have to be compatible with the data passed. If the window size (gap + max_delay + forecast_horizon) is greater than or equal to the split size, then an error will be raised.

>>> X = pd.DataFrame({
...    'dates': pd.date_range("1/1/21", periods=100),
...    'first': [i for i in range(100)],
... })
>>> y = pd.Series([i for i in range(100)])
...
>>> problem_config = {"gap": 7, "max_delay": 2, "forecast_horizon": 12, "time_index": "dates"}
>>> target_leakage_check = TimeSeriesParametersDataCheck(problem_configuration=problem_config, n_splits=4)
>>> assert target_leakage_check.validate(X, y) == {
...     "warnings": [],
...     "errors": [{"message": "Since the data has 100 observations and n_splits=4, the smallest "
...                            "split would have 20 observations. Since 21 (gap + max_delay + forecast_horizon)"
...                            " >= 20, then at least one of the splits would be empty by the time it reaches "
...                            "the pipeline. Please use a smaller number of splits, reduce one or more these "
...                            "parameters, or collect more data.",
...                 "data_check_name": "TimeSeriesParametersDataCheck",
...                 "level": "error",
...                 "code": "TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT",
...                 "details": {'columns': None,
...                             'rows': None,
...                             'max_window_size': 21,
...                             'min_split_size': 20}}],
...     "actions": []}
class evalml.data_checks.TimeSeriesSplittingDataCheck(problem_type, n_splits)[source]

Checks whether the time series target data is compatible with splitting.

If the target data in the training and validation of every split doesn’t have representation from all classes (for time series classification problems) this will prevent the estimators from training on all potential outcomes which will cause errors during prediction.

Parameters
  • problem_type (str or ProblemTypes) – Problem type.

  • n_splits (int) – Number of time series splits.

Methods

name

Return a name describing the data check.

validate

Check if the training and validation targets are compatible with time series data splitting.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if the training and validation targets are compatible with time series data splitting.

Parameters
  • X (pd.DataFrame, np.ndarray) – Ignored. Features.

  • y (pd.Series, np.ndarray) – Target data.

Returns

dict with a DataCheckError if splitting would result in inadequate class representation.

Return type

dict

Example

>>> import pandas as pd

Passing n_splits as 3 means that the data will be segmented into 4 parts to be iterated over for training and validation splits. The first split results in training indices of [0:25] and validation indices of [25:50]. The training indices of the first split result in only one unique value (0). The third split results in training indices of [0:75] and validation indices of [75:100]. The validation indices of the third split result in only one unique value (1).

>>> X = None
>>> y = pd.Series([0 if i < 45 else i % 2 if i < 55 else 1 for i in range(100)])
...
>>> ts_splitting_check = TimeSeriesSplittingDataCheck("time series binary", 3)
>>> assert ts_splitting_check.validate(X, y) == {
...     "errors": [{'message': 'Time Series Binary and Time Series Multiclass problem '
...                             'types require every training and validation split to '
...                             'have at least one instance of all the target classes. '
...                             'The following splits are invalid: [1, 3]',
...                  'data_check_name': 'TimeSeriesSplittingDataCheck',
...                  'level': 'error',
...                  'details': {'columns': None, 'rows': None, 'invalid_splits': {1: {"Training": [0, 25]},
...                                                                                3: {"Validation": [75, 100]}}},
...                  'code': 'TIMESERIES_TARGET_NOT_COMPATIBLE_WITH_SPLIT'}],
...     "warnings": [],
...     "actions": []}
class evalml.data_checks.UniquenessDataCheck(problem_type, threshold=0.5)[source]

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’

  • threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

name

Return a name describing the data check.

uniqueness_score

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

validate

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

name(cls)

Return a name describing the data check.

static uniqueness_score(col, drop_na=True)[source]

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl–Hirschman Index.

Parameters
  • col (pd.Series) – Feature values.

  • drop_na (bool) – Whether to drop null values when computing the uniqueness score. Defaults to True.

Returns

Uniqueness score.

Return type

(float)

validate(self, X, y=None)[source]

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not

unique enough columns.

Return type

dict

Examples

>>> import pandas as pd

Because the problem type is regression, the column “regression_not_unique_enough” raises a warning for having just one value.

>>> df = pd.DataFrame({
...    'regression_unique_enough': [float(x) for x in range(100)],
...    'regression_not_unique_enough': [float(1) for x in range(100)]
... })
...
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.",
...                   "data_check_name": "UniquenessDataCheck",
...                   "level": "warning",
...                   "code": "NOT_UNIQUE_ENOUGH",
...                   "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": 'UniquenessDataCheck',
...                  "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}}]}

For multiclass, the column “regression_unique_enough” has too many unique values and will raise an appropriate warning.

>>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     'warnings': [{'message': "Input columns 'regression_unique_enough' for multiclass problem type are too unique.",
...                   'data_check_name': 'UniquenessDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['regression_unique_enough'],
...                               'rows': None,
...                               'uniqueness_score': {'regression_unique_enough': 0.99}},
...                   'code': 'TOO_UNIQUE'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'UniquenessDataCheck',
...                  'metadata': {'columns': ['regression_unique_enough'], 'rows': None}}]}
>>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3])
>>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625