target_leakage_data_check

Module Contents

Classes Summary

TargetLeakageDataCheck

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

Contents

class evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual')[source]

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters
  • pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.

  • method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.

Methods

name

Returns a name describing the data check.

validate

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

name(cls)

Returns a name describing the data check.

validate(self, X, y)[source]

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features to check

  • y (pd.Series, np.ndarray) – The target data

Returns

dict with a DataCheckWarning if target leakage is detected.

Return type

dict (DataCheckWarning)

Example

>>> import pandas as pd
>>> X = pd.DataFrame({
...    'leak': [10, 42, 31, 51, 61],
...    'x': [42, 54, 12, 64, 12],
...    'y': [13, 5, 13, 74, 24],
... })
>>> y = pd.Series([10, 42, 31, 51, 40])
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == {"warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target",                                                                             "data_check_name": "TargetLeakageDataCheck",                                                                             "level": "warning",                                                                             "code": "TARGET_LEAKAGE",                                                                             "details": {"column": "leak"}}],                                                               "errors": [],                                                               "actions": [{"code": "DROP_COL",                                                                            "metadata": {"column": "leak"}}]}