target_leakage_data_check#
Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
Module Contents#
Classes Summary#
Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics. |
Contents#
- class evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='all')[source]#
Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics.
If method=’mutual_info’, this data check uses mutual information and supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected. Correlation metrics available can be found in Woodwork’s dependence_dict method.
- Parameters
pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘all’ or ‘max’ for the maximum correlation, or for specific correlation metrics, use their name (ie ‘mutual_info’ for mutual information, ‘pearson’ for Pearson correlation, etc). possible methods can be found in Woodwork’s config, under correlation_metrics. Defaults to ‘all’.
Methods
Return a name describing the data check.
Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation.
- name(cls)#
Return a name describing the data check.
- validate(self, X, y)[source]#
Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation.
If method=’mutual_info’ or ‘method=’max’, supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected.
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series, np.ndarray) – The target data.
- Returns
dict with a DataCheckWarning if target leakage is detected.
- Return type
dict (DataCheckWarning)
Examples
>>> import pandas as pd
Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage.
>>> X = pd.DataFrame({ ... "leak": [10, 42, 31, 51, 61] * 15, ... "x": [42, 54, 12, 64, 12] * 15, ... "y": [13, 5, 13, 74, 24] * 15, ... }) >>> y = pd.Series([10, 42, 31, 51, 40] * 15) ... >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Column 'leak' is 95.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "code": "TARGET_LEAKAGE", ... "details": {"columns": ["leak"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak"], "rows": None} ... } ... ] ... } ... ]
The default method can be changed to pearson from mutual_info.
>>> X["x"] = y / 2 >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson") >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "details": {"columns": ["leak", "x"], "rows": None}, ... "code": "TARGET_LEAKAGE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak", "x"], "rows": None} ... } ... ] ... } ... ]