target_leakage_data_check¶
Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
Module Contents¶
Classes Summary¶
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation. |
Contents¶
-
class
evalml.data_checks.target_leakage_data_check.
TargetLeakageDataCheck
(pct_corr_threshold=0.95, method='mutual')[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.
Methods
Return a name describing the data check.
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series, np.ndarray) – The target data.
- Returns
dict with a DataCheckWarning if target leakage is detected.
- Return type
dict (DataCheckWarning)
Examples
>>> import pandas as pd
Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage.
>>> X = pd.DataFrame({ ... "leak": [10, 42, 31, 51, 61], ... "x": [42, 54, 12, 64, 12], ... "y": [13, 5, 13, 74, 24], ... }) >>> y = pd.Series([10, 42, 31, 51, 40]) ... >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Column 'leak' is 95.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "code": "TARGET_LEAKAGE", ... "details": {"columns": ["leak"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak"], "rows": None} ... } ... ] ... } ... ]
The default method can be changed to pearson from mutual information.
>>> X["x"] = y / 2 >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson") >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "details": {"columns": ["leak", "x"], "rows": None}, ... "code": "TARGET_LEAKAGE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak", "x"], "rows": None} ... } ... ] ... } ... ]