target_leakage_data_check¶
Module Contents¶
Classes Summary¶
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation. |
Contents¶
-
class
evalml.data_checks.target_leakage_data_check.
TargetLeakageDataCheck
(pct_corr_threshold=0.95, method='mutual')[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.
Methods
Returns a name describing the data check.
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check
y (pd.Series, np.ndarray) – The target data
- Returns
dict with a DataCheckWarning if target leakage is detected.
- Return type
dict (DataCheckWarning)
Example
>>> import pandas as pd >>> X = pd.DataFrame({ ... 'leak': [10, 42, 31, 51, 61], ... 'x': [42, 54, 12, 64, 12], ... 'y': [13, 5, 13, 74, 24], ... }) >>> y = pd.Series([10, 42, 31, 51, 40]) >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == { ... "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "code": "TARGET_LEAKAGE", ... "details": {"column": "leak"}}], ... "errors": [], ... "actions": [{"code": "DROP_COL", ... "metadata": {"column": "leak"}}]}