target_leakage_data_check ====================================================== .. py:module:: evalml.data_checks.target_leakage_data_check .. autoapi-nested-parse:: Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual_info') Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics. If method='mutual_info', this data check uses mutual information and supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected. Correlation metrics available can be found in Woodwork's `dependence_dict method `_. :param pct_corr_threshold: The correlation threshold to be considered leakage. Defaults to 0.95. :type pct_corr_threshold: float :param method: The method to determine correlation. Use 'max' for the maximum correlation, or for specific correlation metrics, use their name (ie 'mutual_info' for mutual information, 'pearson' for Pearson correlation, etc). possible methods can be found in Woodwork's `config `_, under `correlation_metrics`. Excludes 'all'. Defaults to 'mutual_info'. :type method: string **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck.name evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation. If `method='mutual_info'` or `'method='max'`, supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected. :param X: The input features to check. :type X: pd.DataFrame, np.ndarray :param y: The target data. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if target leakage is detected. :rtype: dict (DataCheckWarning) .. rubric:: Examples >>> import pandas as pd Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage. >>> X = pd.DataFrame({ ... "leak": [10, 42, 31, 51, 61] * 15, ... "x": [42, 54, 12, 64, 12] * 15, ... "y": [13, 5, 13, 74, 24] * 15, ... }) >>> y = pd.Series([10, 42, 31, 51, 40] * 15) ... >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Column 'leak' is 95.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "code": "TARGET_LEAKAGE", ... "details": {"columns": ["leak"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak"], "rows": None} ... } ... ] ... } ... ] The default method can be changed to pearson from mutual_info. >>> X["x"] = y / 2 >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson") >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "details": {"columns": ["leak", "x"], "rows": None}, ... "code": "TARGET_LEAKAGE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak", "x"], "rows": None} ... } ... ] ... } ... ]