target_leakage_data_check¶

Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

Module Contents¶

Classes Summary¶

TargetLeakageDataCheck

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

Contents¶

class evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual')[source]¶

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters

pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

name(cls)¶: Return a name describing the data check.

validate(self, X, y)[source]¶

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters

X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series, np.ndarray) – The target data.

Returns

dict with a DataCheckWarning if target leakage is detected.

Return type

dict (DataCheckWarning)

Examples

>>> import pandas as pd

Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage.

>>> X = pd.DataFrame({
...    "leak": [10, 42, 31, 51, 61],
...    "x": [42, 54, 12, 64, 12],
...    "y": [13, 5, 13, 74, 24],
... })
>>> y = pd.Series([10, 42, 31, 51, 40])
...
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == [
...     {
...         "message": "Column 'leak' is 95.0% or more correlated with the target",
...         "data_check_name": "TargetLeakageDataCheck",
...         "level": "warning",
...         "code": "TARGET_LEAKAGE",
...         "details": {"columns": ["leak"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "TargetLeakageDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["leak"], "rows": None}
...             }
...         ]
...     }
... ]

The default method can be changed to pearson from mutual information.

>>> X["x"] = y / 2
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson")
>>> assert target_leakage_check.validate(X, y) == [
...     {
...         "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target",
...         "data_check_name": "TargetLeakageDataCheck",
...         "level": "warning",
...         "details": {"columns": ["leak", "x"], "rows": None},
...         "code": "TARGET_LEAKAGE",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "TargetLeakageDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["leak", "x"], "rows": None}
...             }
...         ]
...     }
... ]

target_distribution_data_check

ts_parameters_data_check