
Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

Module Contents

Classes Summary


Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.


class evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual')[source]

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

  • pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.

  • method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.



Return a name describing the data check.


Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.


Return a name describing the data check.

validate(self, X, y)[source]

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].

  • X (pd.DataFrame, np.ndarray) – The input features to check.

  • y (pd.Series, np.ndarray) – The target data.


dict with a DataCheckWarning if target leakage is detected.

Return type

dict (DataCheckWarning)


>>> import pandas as pd
>>> X = pd.DataFrame({
...    'leak': [10, 42, 31, 51, 61],
...    'x': [42, 54, 12, 64, 12],
...    'y': [13, 5, 13, 74, 24],
... })
>>> y = pd.Series([10, 42, 31, 51, 40])
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == {
...     "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target",
...                   "data_check_name": "TargetLeakageDataCheck",
...                   "level": "warning",
...                   "code": "TARGET_LEAKAGE",
...                   "details": {"columns": ["leak"], "rows": None}}],
...     "errors": [],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"columns": ["leak"], "rows": None}}]}
>>> X['x'] = y / 2
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method='pearson')
>>> assert target_leakage_check.validate(X, y) == {
...     'warnings': [{'message': "Columns 'leak', 'x' are 80.0% or more correlated with the target",
...                   'data_check_name': 'TargetLeakageDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['leak', 'x'], 'rows': None},
...                   'code': 'TARGET_LEAKAGE'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'metadata': {'columns': ['leak', 'x'], 'rows': None}}]}