target_leakage_data_check#

Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

Module Contents#

Classes Summary#

TargetLeakageDataCheck

Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics.

Contents#

class evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='all')[source]#

Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics.

If method=’mutual_info’, this data check uses mutual information and supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected. Correlation metrics available can be found in Woodwork’s dependence_dict method.

Parameters

pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘all’ or ‘max’ for the maximum correlation, or for specific correlation metrics, use their name (ie ‘mutual_info’ for mutual information, ‘pearson’ for Pearson correlation, etc). possible methods can be found in Woodwork’s config, under correlation_metrics. Defaults to ‘all’.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation.

name(cls)#: Return a name describing the data check.

validate(self, X, y)[source]#

Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation.

If method=’mutual_info’ or ‘method=’max’, supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected.

Parameters

X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series, np.ndarray) – The target data.

Returns

dict with a DataCheckWarning if target leakage is detected.

Return type

dict (DataCheckWarning)

Examples

>>> import pandas as pd

Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage.

>>> X = pd.DataFrame({
...    "leak": [10, 42, 31, 51, 61] * 15,
...    "x": [42, 54, 12, 64, 12] * 15,
...    "y": [13, 5, 13, 74, 24] * 15,
... })
>>> y = pd.Series([10, 42, 31, 51, 40] * 15)
...
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == [
...     {
...         "message": "Column 'leak' is 95.0% or more correlated with the target",
...         "data_check_name": "TargetLeakageDataCheck",
...         "level": "warning",
...         "code": "TARGET_LEAKAGE",
...         "details": {"columns": ["leak"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "TargetLeakageDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["leak"], "rows": None}
...             }
...         ]
...     }
... ]

The default method can be changed to pearson from mutual_info.

>>> X["x"] = y / 2
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson")
>>> assert target_leakage_check.validate(X, y) == [
...     {
...         "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target",
...         "data_check_name": "TargetLeakageDataCheck",
...         "level": "warning",
...         "details": {"columns": ["leak", "x"], "rows": None},
...         "code": "TARGET_LEAKAGE",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "TargetLeakageDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["leak", "x"], "rows": None}
...             }
...         ]
...     }
... ]