Data Checks¶

Data checks.

Package Contents¶

Classes Summary¶

`ClassImbalanceDataCheck`	Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.
`DataCheck`	Base class for all data checks.
`DataCheckAction`	A recommended action returned by a DataCheck.
`DataCheckActionCode`	Enum for data check action code.
`DataCheckError`	DataCheckMessage subclass for errors returned by data checks.
`DataCheckMessage`	Base class for a message returned by a DataCheck, tagged by name.
`DataCheckMessageCode`	Enum for data check message code.
`DataCheckMessageType`	Enum for type of data check message: WARNING or ERROR.
`DataChecks`	A collection of data checks.
`DataCheckWarning`	DataCheckMessage subclass for warnings returned by data checks.
`DateTimeFormatDataCheck`	Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.
`DateTimeNaNDataCheck`	Check each column in the input for datetime features and will issue an error if NaN values are present.
`DefaultDataChecks`	A collection of basic data checks that is used by AutoML by default.
`HighlyNullDataCheck`	Check if there are any highly-null columns and rows in the input.
`IDColumnsDataCheck`	Check if any of the features are likely to be ID columns.
`InvalidTargetDataCheck`	Check if the target data contains missing or invalid values.
`MulticollinearityDataCheck`	Check if any set features are likely to be multicollinear.
`NaturalLanguageNaNDataCheck`	Checks each column in the input for natural language features and will issue an error if NaN values are present.
`NoVarianceDataCheck`	Check if the target or any of the features have no variance.
`OutliersDataCheck`	Checks if there are any outliers in input data by using IQR to determine score anomalies.
`SparsityDataCheck`	Check if there are any columns with sparsely populated values in the input.
`TargetDistributionDataCheck`	Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera.
`TargetLeakageDataCheck`	Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
`UniquenessDataCheck`	Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Contents¶

class evalml.data_checks.ClassImbalanceDataCheck(threshold=0.1, min_samples=100, num_cv_folds=3)[source]¶

Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.

Parameters

threshold (float) – The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10.
min_samples (int) – The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100.
num_cv_folds (int) – The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems.

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems.

Ignores NaN values in target labels if they appear.

Parameters

X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target labels to check for imbalanced data.

Returns

Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,: and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

dict

Example

>>> import pandas as pd
>>> X = pd.DataFrame()
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
>>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]",
...                                                    "data_check_name": "ClassImbalanceDataCheck",
...                                                    "level": "error",
...                                                    "code": "CLASS_IMBALANCE_BELOW_FOLDS",
...                                                    "details": {"target_values": [0], "rows": None, "columns": None}}],
...                                      "warnings": [{"message": "The following labels fall below 10% of the target: [0]",
...                                                    "data_check_name": "ClassImbalanceDataCheck",
...                                                    "level": "warning",
...                                                    "code": "CLASS_IMBALANCE_BELOW_THRESHOLD",
...                                                    "details": {"target_values": [0], "rows": None, "columns": None}},
...                                                    {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]",
...                                                    "data_check_name": "ClassImbalanceDataCheck",
...                                                    "level": "warning",
...                                                    "code": "CLASS_IMBALANCE_SEVERE",
...                                                    "details": {"target_values": [0], "rows": None, "columns": None}}],
...                                      "actions": []}

class evalml.data_checks.DataCheck[source]¶

Base class for all data checks.

Data checks are a set of heuristics used to determine if there are problems with input data.

Methods

`name`	Return a name describing the data check.
`validate`	Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.

name(cls)¶: Return a name describing the data check.

abstract validate(self, X, y=None)[source]¶

Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.

Parameters

X (pd.DataFrame) – The input data of shape [n_samples, n_features]
y (pd.Series, optional) – The target data of length [n_samples]

Returns

Dictionary of DataCheckError and DataCheckWarning messages

Return type

dict (DataCheckMessage)

class evalml.data_checks.DataCheckAction(action_code, metadata=None)[source]¶

A recommended action returned by a DataCheck.

Parameters

action_code (DataCheckActionCode) – Action code associated with the action.
metadata (dict, optional) – Additional useful information associated with the action. Defaults to None.

Methods

to_dict

Return a dictionary form of the data check action.

to_dict(self)[source]¶: Return a dictionary form of the data check action.

class evalml.data_checks.DataCheckActionCode[source]¶

Enum for data check action code.

Attributes

DROP_COL	Action code for dropping a column.
DROP_ROWS	Action code for dropping rows.
IMPUTE_COL	Action code for imputing a column.
TRANSFORM_TARGET	Action code for transforming the target data.

Methods

`name`	The name of the Enum member.
`value`	The value of the Enum member.

name(self)¶: The name of the Enum member.

value(self)¶: The value of the Enum member.

class evalml.data_checks.DataCheckError(message, data_check_name, message_code=None, details=None)[source]¶

DataCheckMessage subclass for errors returned by data checks.

Attributes

message_type

DataCheckMessageType.ERROR

Methods

to_dict

Return a dictionary form of the data check message.

to_dict(self)¶: Return a dictionary form of the data check message.

class evalml.data_checks.DataCheckMessage(message, data_check_name, message_code=None, details=None)[source]¶

Base class for a message returned by a DataCheck, tagged by name.

Parameters

message (str) – Message string.
data_check_name (str) – Name of data check.
message_code (DataCheckMessageCode) – Message code associated with message. Defaults to None.
details (dict) – Additional useful information associated with the message. Defaults to None.

Attributes

message_type

None

Methods

to_dict

Return a dictionary form of the data check message.

to_dict(self)[source]¶: Return a dictionary form of the data check message.

class evalml.data_checks.DataCheckMessageCode[source]¶

Enum for data check message code.

Attributes

CLASS_IMBALANCE_BELOW_FOLDS	Message code for when the number of values for each target is below 2 * number of CV folds.
CLASS_IMBALANCE_BELOW_THRESHOLD	Message code for when balance in classes is less than the threshold.
CLASS_IMBALANCE_SEVERE	Message code for when balance in classes is less than the threshold and minimum class is less than minimum number of accepted samples.
DATETIME_HAS_NAN	Message code for when input datetime columns contain NaN values.
DATETIME_HAS_UNEVEN_INTERVALS	Message code for when the datetime values have uneven intervals.
DATETIME_INFORMATION_NOT_FOUND	Message code for when datetime information can not be found or is in an unaccepted format.
DATETIME_IS_NOT_MONOTONIC	Message code for when the datetime values are not monotonically increasing.
HAS_ID_COLUMN	Message code for data that has ID columns.
HAS_OUTLIERS	Message code for when outliers are detected.
HIGH_VARIANCE	Message code for when high variance is detected for cross-validation.
HIGHLY_NULL_COLS	Message code for highly null columns.
HIGHLY_NULL_ROWS	Message code for highly null rows.
IS_MULTICOLLINEAR	Message code for when data is potentially multicollinear.
MISMATCHED_INDICES	Message code for when input target and features have mismatched indices.
MISMATCHED_INDICES_ORDER	Message code for when input target and features have mismatched indices order. The two inputs have the same index values, but shuffled.
MISMATCHED_LENGTHS	Message code for when input target and features have different lengths.
NATURAL_LANGUAGE_HAS_NAN	Message code for when input natural language columns contain NaN values.
NO_VARIANCE	Message code for when data has no variance (1 unique value).
NO_VARIANCE_WITH_NULL	Message code for when data has one unique value and NaN values.
NOT_UNIQUE_ENOUGH	Message code for when data does not possess enough unique values.
TARGET_BINARY_NOT_TWO_UNIQUE_VALUES	Message code for target data for a binary classification problem that does not have two unique values.
TARGET_HAS_NULL	Message code for target data that has null values.
TARGET_INCOMPATIBLE_OBJECTIVE	Message code for target data that has incompatible values for the specified objective
TARGET_IS_EMPTY_OR_FULLY_NULL	Message code for target data that is empty or has all null values.
TARGET_IS_NONE	Message code for when target is None.
TARGET_LEAKAGE	Message code for when target leakage is detected.
TARGET_LOGNORMAL_DISTRIBUTION	Message code for target data with a lognormal distribution.
TARGET_MULTICLASS_HIGH_UNIQUE_CLASS	Message code for target data for a multi classification problem that has an abnormally large number of unique classes relative to the number of target values.
TARGET_MULTICLASS_NOT_ENOUGH_CLASSES	Message code for target data for a multi classification problem that does not have more than two unique classes.
TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS	Message code for target data for a multi classification problem that does not have two examples per class.
TARGET_UNSUPPORTED_PROBLEM_TYPE	Message code for target data that is being checked against an unsupported problem type.
TARGET_UNSUPPORTED_TYPE	Message code for target data that is of an unsupported type.
TOO_SPARSE	Message code for when multiclass data has values that are too sparsely populated.
TOO_UNIQUE	Message code for when data possesses too many unique values.

Methods

`name`	The name of the Enum member.
`value`	The value of the Enum member.

name(self)¶: The name of the Enum member.

value(self)¶: The value of the Enum member.

class evalml.data_checks.DataCheckMessageType[source]¶

Enum for type of data check message: WARNING or ERROR.

Attributes

ERROR	Error message returned by a data check.
WARNING	Warning message returned by a data check.

Methods

`name`	The name of the Enum member.
`value`	The value of the Enum member.

name(self)¶: The name of the Enum member.

value(self)¶: The value of the Enum member.

class evalml.data_checks.DataChecks(data_checks=None, data_check_params=None)[source]¶

A collection of data checks.

Parameters

data_checks (list (DataCheck)) – List of DataCheck objects.
data_check_params (dict) – Parameters for passed DataCheck objects.

Methods

validate

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

validate(self, X, y=None)[source]¶

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

Parameters

X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]

Returns

Dictionary containing DataCheckMessage objects

Return type

dict

class evalml.data_checks.DataCheckWarning(message, data_check_name, message_code=None, details=None)[source]¶

DataCheckMessage subclass for warnings returned by data checks.

Attributes

message_type

DataCheckMessageType.WARNING

Methods

to_dict

Return a dictionary form of the data check message.

to_dict(self)¶: Return a dictionary form of the data check message.

class evalml.data_checks.DateTimeFormatDataCheck(datetime_column='index')[source]¶

Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.

Parameters: datetime_column (str, int) – The name of the datetime column. If the datetime values are in the index, then pass “index”.

Methods

`name`	Return a name describing the data check.
`validate`	Checks if the target data has equal intervals and is sorted.

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Checks if the target data has equal intervals and is sorted.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Target data.

Returns

List with DataCheckErrors if unequal intervals are found in the datetime column.

Return type

dict (DataCheckError)

Example

>>> import pandas as pd
>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-31", periods=1)), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0])
>>> datetime_format_check = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_check.validate(X, y) == {
...     "errors": [{"message": "No frequency could be detected in dates, possibly due to uneven intervals.",
...                 "data_check_name": "DateTimeFormatDataCheck",
...                 "level": "error",
...                 "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...                 "details": {"columns": None, "rows": None}
...                 }],
...     "warnings": [],
...     "actions": []}

class evalml.data_checks.DateTimeNaNDataCheck[source]¶

Check each column in the input for datetime features and will issue an error if NaN values are present.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any datetime columns contain NaN values.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if any datetime columns contain NaN values.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if NaN values are present in datetime columns.

Return type

dict

Example

>>> import pandas as pd
>>> import woodwork as ww
>>> import numpy as np
>>> dates = np.arange(np.datetime64('2017-01-01'), np.datetime64('2017-01-08'))
>>> dates[0] = np.datetime64('NaT')
>>> df = pd.DataFrame(dates, columns=['index'])
>>> df.ww.init()
>>> dt_nan_check = DateTimeNaNDataCheck()
>>> assert dt_nan_check.validate(df) == {"warnings": [],
...                                             "actions": [],
...                                             "errors": [DataCheckError(message='Input datetime column(s) (index) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                                                                     data_check_name=DateTimeNaNDataCheck.name,
...                                                                     message_code=DataCheckMessageCode.DATETIME_HAS_NAN,
...                                                                     details={"columns": ['index'], "rows": None}).to_dict()]}

class evalml.data_checks.DefaultDataChecks(problem_type, objective, n_splits=3, datetime_column=None)[source]¶

A collection of basic data checks that is used by AutoML by default.

Includes:

HighlyNullDataCheck

HighlyNullRowsDataCheck

IDColumnsDataCheck

TargetLeakageDataCheck

InvalidTargetDataCheck

NoVarianceDataCheck

ClassImbalanceDataCheck (for classification problem types)

DateTimeNaNDataCheck

NaturalLanguageNaNDataCheck

TargetDistributionDataCheck (for regression problem types)

DateTimeFormatDataCheck (for time series problem types)

Parameters

problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.
datetime_column (str) – The name of the column containing datetime information to be used for time series problems.
to "index" indicating that the datetime information is in the index of X or y. (Default) –

Methods

validate

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

validate(self, X, y=None)¶

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

Parameters

X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]

Returns

Dictionary containing DataCheckMessage objects

Return type

dict

class evalml.data_checks.HighlyNullDataCheck(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]¶

Check if there are any highly-null columns and rows in the input.

Parameters

pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.
pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.

Methods

`name`	Return a name describing the data check.
`validate`	Check if there are any highly-null columns or rows in the input.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if there are any highly-null columns or rows in the input.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any highly-null columns or rows.

Return type

dict

Example

>>> import pandas as pd
>>> class SeriesWrap():
...     def __init__(self, series):
...         self.series = series
...
...     def __eq__(self, series_2):
...         return all(self.series.eq(series_2.series))
...
>>> df = pd.DataFrame({
...    'lots_of_null': [None, None, None, None, 5],
...    'no_null': [1, 2, 3, 4, 5]
... })
>>> null_check = HighlyNullDataCheck(pct_null_col_threshold=0.50, pct_null_row_threshold=0.50)
>>> validation_results = null_check.validate(df)
>>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols'])
>>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.5, 0.5]))
>>> assert validation_results == {
...     "errors": [],
...     "warnings": [{"message": "4 out of 5 rows are more than 50.0% null",
...                   "data_check_name": "HighlyNullDataCheck",
...                   "level": "warning",
...                   "code": "HIGHLY_NULL_ROWS",
...                   "details": {"pct_null_cols": highly_null_rows, "columns": None, "rows": [0, 1, 2, 3]}},
...                  {"message": "Columns 'lots_of_null' are 50.0% or more null",
...                   "data_check_name": "HighlyNullDataCheck",
...                   "level": "warning",
...                   "code": "HIGHLY_NULL_COLS",
...                   "details": {"columns": ["lots_of_null"], "pct_null_rows": {"lots_of_null": 0.8}, "null_row_indices": {"lots_of_null": [0, 1, 2, 3]}, "rows": None}}],
...    "actions": [{"code": "DROP_ROWS", "metadata": {"rows": [0, 1, 2, 3], "columns": None}},
...                {"code": "DROP_COL", "metadata": {"columns": ["lots_of_null"], "rows": None}}]}

class evalml.data_checks.IDColumnsDataCheck(id_threshold=1.0)[source]¶

Check if any of the features are likely to be ID columns.

Parameters: id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

Checks performed are:

column name is “id”

column name ends in “_id”

column contains all unique values (and is categorical / integer type)

Parameters

X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series) – The target. Defaults to None. Ignored.

Returns

A dictionary of features with column name or index and their probability of being ID columns

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'df_id': [0, 1, 2, 3, 4],
...     'x': [10, 42, 31, 51, 61],
...     'y': [42, 54, 12, 64, 12]
... })
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Columns 'df_id' are 100.0% or more likely to be an ID column",
...                   "data_check_name": "IDColumnsDataCheck",
...                   "level": "warning",
...                   "code": "HAS_ID_COLUMN",
...                   "details": {"columns": ["df_id"], "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"columns": ["df_id"], "rows": None}}]}

class evalml.data_checks.InvalidTargetDataCheck(problem_type, objective, n_unique=100)[source]¶

Check if the target data contains missing or invalid values.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_unique (int) – Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100.

Attributes

multiclass_continuous_threshold

0.05

Methods

`name`	Return a name describing the data check.
`validate`	Check if the target data contains missing or invalid values.

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Check if the target data contains missing or invalid values.

Parameters

X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for invalid values.

Returns

List with DataCheckErrors if any invalid values are found in the target data.

Return type

dict (DataCheckError)

Example

>>> import pandas as pd
>>> X = pd.DataFrame({"col": [1, 2, 3, 1]})
>>> y = pd.Series([0, 1, None, None])
>>> target_check = InvalidTargetDataCheck('binary', 'Log Loss Binary')
>>> assert target_check.validate(X, y) == {
...     "errors": [{"message": "2 row(s) (50.0%) of target values are null",
...                 "data_check_name": "InvalidTargetDataCheck",
...                 "level": "error",
...                 "code": "TARGET_HAS_NULL",
...                 "details": {"num_null_rows": 2, "pct_null_rows": 50, "rows": None, "columns": None}}],
...     "warnings": [],
...     "actions": [{"code": "IMPUTE_COL", "metadata": {"impute_strategy": "most_frequent", "is_target": True, "rows": None, "columns": None}}]}

class evalml.data_checks.MulticollinearityDataCheck(threshold=0.9)[source]¶

Check if any set features are likely to be multicollinear.

Parameters: threshold (float) – The threshold to be considered. Defaults to 0.9.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any set of features are likely to be multicollinear.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if any set of features are likely to be multicollinear.

Parameters

X (pd.DataFrame) – The input features to check.
y (pd.Series) – The target. Ignored.

Returns

dict with a DataCheckWarning if there are any potentially multicollinear columns.

Return type

dict

Example

>>> import pandas as pd
>>> col = pd.Series([1, 0, 2, 3, 4])
>>> X = pd.DataFrame({"col_1": col, "col_2": col * 3})
>>> y = pd.Series([1, 0, 0, 1, 0])
>>> multicollinearity_check = MulticollinearityDataCheck(threshold=0.8)
>>> assert multicollinearity_check.validate(X, y) == {
...     "errors": [],
...     "warnings": [{'message': "Columns are likely to be correlated: [('col_1', 'col_2')]",
...                   "data_check_name": "MulticollinearityDataCheck",
...                   "level": "warning",
...                   "code": "IS_MULTICOLLINEAR",
...                   'details': {"columns": [('col_1', 'col_2')], "rows": None}}],
...     "actions": []}

class evalml.data_checks.NaturalLanguageNaNDataCheck[source]¶

Checks each column in the input for natural language features and will issue an error if NaN values are present.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any natural language columns contain NaN values.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if any natural language columns contain NaN values.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if NaN values are present in natural language columns.

Return type

dict

Example

>>> import pandas as pd
>>> import woodwork as ww
>>> import numpy as np
>>> data = pd.DataFrame()
>>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"]
>>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language']
>>> data['C'] = np.random.randint(0, 3, size=len(data))
>>> data.ww.init(logical_types={'A': 'NaturalLanguage', 'B': 'NaturalLanguage'})
>>> nl_nan_check = NaturalLanguageNaNDataCheck()
>>> assert nl_nan_check.validate(data) == {
...        "warnings": [],
...        "actions": [],
...        "errors": [DataCheckError(message='Input natural language column(s) (A) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                      data_check_name=NaturalLanguageNaNDataCheck.name,
...                      message_code=DataCheckMessageCode.NATURAL_LANGUAGE_HAS_NAN,
...                      details={"columns": ['A']}).to_dict()]
...    }

class evalml.data_checks.NoVarianceDataCheck(count_nan_as_value=False)[source]¶

Check if the target or any of the features have no variance.

Parameters: count_nan_as_value (bool) – If True, missing values will be counted as their own unique value. Additionally, if true, will return a DataCheckWarning instead of an error if the feature has mostly missing data and only one unique value. Defaults to False.

Methods

`name`	Return a name describing the data check.
`validate`	Check if the target or any of the features have no variance (1 unique value).

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Check if the target or any of the features have no variance (1 unique value).

Parameters

X (pd.DataFrame, np.ndarray) – The input features.
y (pd.Series, np.ndarray) – The target data.

Returns

A dict of warnings/errors corresponding to features or target with no variance.

Return type

dict

class evalml.data_checks.OutliersDataCheck[source]¶

Checks if there are any outliers in input data by using IQR to determine score anomalies.

Columns with score anomalies are considered to contain outliers.

Methods

`get_boxplot_data`	Returns box plot information for the given data.
`name`	Return a name describing the data check.
`validate`	Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.

static get_boxplot_data(data_)[source]¶

Returns box plot information for the given data.

Parameters: data (pd.Series, np.ndarray) – Input data.
Returns: A payload of box plot statistics.
Return type: dict

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.

Parameters

X (pd.DataFrame, np.ndarray) – Input features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

A dictionary with warnings if any columns have outliers.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'x': [1, 2, 3, 4, 5],
...     'y': [6, 7, 8, 9, 10],
...     'z': [-1, -2, -3, -1201, -4]
... })
>>> outliers_check = OutliersDataCheck()
>>> assert outliers_check.validate(df) == {
...     "warnings": [{"message": "Column(s) 'z' are likely to have outlier data.",
...                   "data_check_name": "OutliersDataCheck",
...                   "level": "warning",
...                   "code": "HAS_OUTLIERS",
...                   "details": {"columns": ["z"], "rows": [3], "column_indices": {"z": [3]}}}],
...     "errors": [],
...     "actions": [{"code": "DROP_ROWS", "metadata": {"rows": [3], "columns": None}}]}

class evalml.data_checks.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]¶

Check if there are any columns with sparsely populated values in the input.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

`name`	Return a name describing the data check.
`sparsity_score`	Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
`validate`	Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

name(cls)¶: Return a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]¶

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters

col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]¶

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'sparse': [float(x) for x in range(100)],
...    'not_sparse': [float(1) for x in range(100)]
... })
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns ('sparse') for multiclass problem type are too sparse.",
...                   "data_check_name": "SparsityDataCheck",
...                    "level": "warning",
...                    "code": "TOO_SPARSE",
...                    "details": {"columns": ["sparse"], "sparsity_score": {"sparse": 0.0}, "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"columns": ["sparse"], "rows": None}}]}

class evalml.data_checks.TargetDistributionDataCheck[source]¶

Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera.

Methods

`name`	Return a name describing the data check.
`validate`	Check if the target data has a certain distribution.

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Check if the target data has a certain distribution.

Parameters

X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for underlying distributions.

Returns

List with DataCheckErrors if certain distributions are found in the target data.

Return type

dict (DataCheckError)

Example

>>> from scipy.stats import lognorm
>>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897]
>>> target_check = TargetDistributionDataCheck()
>>> assert target_check.validate(None, y) == {
...     "errors": [],
...     "warnings": [{"message": "Target may have a lognormal distribution.",
...                   "data_check_name": "TargetDistributionDataCheck",
...                   "level": "warning",
...                   "code": "TARGET_LOGNORMAL_DISTRIBUTION",
...                   "details": {"shapiro-statistic/pvalue": '0.8/0.045', "columns": None, "rows": None}}],
...     "actions": [{'code': 'TRANSFORM_TARGET', 'metadata': {'transformation_strategy': 'lognormal', 'is_target': True, "columns": None, "rows": None}}]}

class evalml.data_checks.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual')[source]¶

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters

pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters

X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series, np.ndarray) – The target data.

Returns

dict with a DataCheckWarning if target leakage is detected.

Return type

dict (DataCheckWarning)

Example

>>> import pandas as pd
>>> X = pd.DataFrame({
...    'leak': [10, 42, 31, 51, 61],
...    'x': [42, 54, 12, 64, 12],
...    'y': [13, 5, 13, 74, 24],
... })
>>> y = pd.Series([10, 42, 31, 51, 40])
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == {
...     "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target",
...                   "data_check_name": "TargetLeakageDataCheck",
...                   "level": "warning",
...                   "code": "TARGET_LEAKAGE",
...                   "details": {"columns": ["leak"], "rows": None}}],
...     "errors": [],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"columns": ["leak"], "rows": None}}]}

class evalml.data_checks.UniquenessDataCheck(problem_type, threshold=0.5)[source]¶

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

`name`	Return a name describing the data check.
`uniqueness_score`	Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
`validate`	Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

name(cls)¶: Return a name describing the data check.

static uniqueness_score(col)[source]¶

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl–Hirschman Index.

Parameters: col (pd.Series) – Feature values.
Returns: Uniqueness score.
Return type: (float)

validate(self, X, y=None)[source]¶

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not: unique enough columns.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'regression_unique_enough': [float(x) for x in range(100)],
...    'regression_not_unique_enough': [float(1) for x in range(100)]
... })
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.",
...                   "data_check_name": "UniquenessDataCheck",
...                   "level": "warning",
...                   "code": "NOT_UNIQUE_ENOUGH",
...                   "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}}]}

utils

class_imbalance_data_check

Data Checks¶

Submodules¶

Package Contents¶

Classes Summary¶

Contents¶