invalid_targets_data_check

Data check that checks if the target data contains missing or invalid values.

Module Contents

Classes Summary

InvalidTargetDataCheck

Check if the target data contains missing or invalid values.

Contents

class evalml.data_checks.invalid_targets_data_check.InvalidTargetDataCheck(problem_type, objective, n_unique=100)[source]

Check if the target data contains missing or invalid values.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’

  • objective (str or ObjectiveBase) – Name or instance of the objective class.

  • n_unique (int) – Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100.

Attributes

multiclass_continuous_threshold

0.05

Methods

name

Return a name describing the data check.

validate

Check if the target data contains missing or invalid values.

name(cls)

Return a name describing the data check.

validate(self, X, y)[source]

Check if the target data contains missing or invalid values.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features. Ignored.

  • y (pd.Series, np.ndarray) – Target data to check for invalid values.

Returns

List with DataCheckErrors if any invalid values are found in the target data.

Return type

dict (DataCheckError)

Examples

>>> import pandas as pd
...
>>> X = pd.DataFrame({"col": [1, 2, 3, 1]})
>>> y = pd.Series(["cat_1", "cat_2", "cat_1", "cat_2"])
>>> target_check = InvalidTargetDataCheck('regression', 'R2')
>>> assert target_check.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target is unsupported Unknown type. Valid Woodwork logical types include: integer, double, boolean',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None, 'unsupported_type': 'unknown'},
...                 'code': 'TARGET_UNSUPPORTED_TYPE'},
...                {'message': 'Target data type should be numeric for regression type problems.',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None},
...                 'code': 'TARGET_UNSUPPORTED_TYPE'}],
...     'actions': []}
...
...
>>> y = pd.Series([None, pd.NA, pd.NaT, None])
>>> assert target_check.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target is either empty or fully null.',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None},
...                 'code': 'TARGET_IS_EMPTY_OR_FULLY_NULL'}],
...     'actions': []}
...
...
>>> y = pd.Series([1, None, 3, None])
>>> assert target_check.validate(None, y) == {
...     'warnings': [],
...     'errors': [{'message': '2 row(s) (50.0%) of target values are null',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None,
...                             'rows': None,
...                             'num_null_rows': 2,
...                             'pct_null_rows': 50.0},
...                 'code': 'TARGET_HAS_NULL'}],
...     'actions': [{'code': 'IMPUTE_COL',
...                  'metadata': {'columns': None,
...                               'rows': None,
...                               'is_target': True,
...                               'impute_strategy': 'mean'}}]}
...
...
>>> X = pd.DataFrame([i for i in range(50)])
>>> y = pd.Series([i%2 for i in range(50)])
>>> target_check = InvalidTargetDataCheck('multiclass', 'Log Loss Multiclass')
>>> assert target_check.validate(X, y) == {
...     'warnings': [],
...     'errors': [{'message': 'Target has two or less classes, which is too few for multiclass problems.  Consider changing to binary.',
...                 'data_check_name': 'InvalidTargetDataCheck',
...                 'level': 'error',
...                 'details': {'columns': None, 'rows': None, 'num_classes': 2},
...                 'code': 'TARGET_MULTICLASS_NOT_ENOUGH_CLASSES'}],
...     'actions': []}
...
...
>>> target_check = InvalidTargetDataCheck('regression', 'R2')
>>> X = pd.DataFrame([i for i in range(5)])
>>> y = pd.Series([1, 2, 4, 3], index=[1, 2, 4, 3])
>>> assert target_check.validate(X, y) == {
...     'warnings': [{'message': 'Input target and features have different lengths',
...                   'data_check_name': 'InvalidTargetDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': None,
...                               'rows': None,
...                               'features_length': 5,
...                               'target_length': 4},
...                   'code': 'MISMATCHED_LENGTHS'},
...                  {'message': 'Input target and features have mismatched indices',
...                   'data_check_name': 'InvalidTargetDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': None,
...                               'rows': None,
...                               'indices_not_in_features': [],
...                               'indices_not_in_target': [0]},
...                   'code': 'MISMATCHED_INDICES'}],
...     'errors': [],
...     'actions': []}