target_leakage_data_check
======================================================

.. py:module:: evalml.data_checks.target_leakage_data_check

.. autoapi-nested-parse::

   Data check that checks if any of the features are highly correlated with the target by using mutual information or Pearson correlation.


Module Contents
---------------

Classes Summary
~~~~~~~~~~~~~~~

.. autoapisummary::

   evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck


Contents
~~~~~~~~~~~~~~~~~~~
.. py:class:: TargetLeakageDataCheck(pct_corr_threshold=0.95, method='mutual_info')


   Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics.

   If method='mutual_info', this data check uses mutual information and supports all target and feature types.
   Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected
   and will returns a value in [0, 1] if mutual information is selected. Correlation metrics available can be found in Woodwork's
   `dependence_dict method <https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.dependence_dict.html#woodwork.table_accessor.WoodworkTableAccessor.dependence_dict>`_.

   :param pct_corr_threshold: The correlation threshold to be considered leakage. Defaults to 0.95.
   :type pct_corr_threshold: float
   :param method: The method to determine correlation. Use 'max' for the maximum correlation, or for specific correlation metrics, use their name (ie 'mutual_info' for mutual information, 'pearson' for Pearson correlation, etc).
                  possible methods can be found in Woodwork's `config <https://woodwork.alteryx.com/en/stable/guides/setting_config_options.html?highlight=config#Viewing-Config-Settings>`_, under `correlation_metrics`.
                  Excludes 'all'. Defaults to 'mutual_info'.
   :type method: string


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck.name
      evalml.data_checks.target_leakage_data_check.TargetLeakageDataCheck.validate

   .. py:method:: name(cls)

      Return a name describing the data check.


   .. py:method:: validate(self, X, y)

      Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation.

      If `method='mutual_info'` or `'method='max'`, supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes.
      This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected.

      :param X: The input features to check.
      :type X: pd.DataFrame, np.ndarray
      :param y: The target data.
      :type y: pd.Series, np.ndarray

      :returns: dict with a DataCheckWarning if target leakage is detected.
      :rtype: dict (DataCheckWarning)

      .. rubric:: Examples

      >>> import pandas as pd

      Any columns that are strongly correlated with the target will raise a warning. This could be indicative of
      data leakage.

      >>> X = pd.DataFrame({
      ...    "leak": [10, 42, 31, 51, 61] * 15,
      ...    "x": [42, 54, 12, 64, 12] * 15,
      ...    "y": [13, 5, 13, 74, 24] * 15,
      ... })
      >>> y = pd.Series([10, 42, 31, 51, 40] * 15)
      ...
      >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
      >>> assert target_leakage_check.validate(X, y) == [
      ...     {
      ...         "message": "Column 'leak' is 95.0% or more correlated with the target",
      ...         "data_check_name": "TargetLeakageDataCheck",
      ...         "level": "warning",
      ...         "code": "TARGET_LEAKAGE",
      ...         "details": {"columns": ["leak"], "rows": None},
      ...         "action_options": [
      ...             {
      ...                 "code": "DROP_COL",
      ...                 "data_check_name": "TargetLeakageDataCheck",
      ...                 "parameters": {},
      ...                 "metadata": {"columns": ["leak"], "rows": None}
      ...             }
      ...         ]
      ...     }
      ... ]


      The default method can be changed to pearson from mutual_info.

      >>> X["x"] = y / 2
      >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson")
      >>> assert target_leakage_check.validate(X, y) == [
      ...     {
      ...         "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target",
      ...         "data_check_name": "TargetLeakageDataCheck",
      ...         "level": "warning",
      ...         "details": {"columns": ["leak", "x"], "rows": None},
      ...         "code": "TARGET_LEAKAGE",
      ...         "action_options": [
      ...             {
      ...                 "code": "DROP_COL",
      ...                  "data_check_name": "TargetLeakageDataCheck",
      ...                  "parameters": {},
      ...                  "metadata": {"columns": ["leak", "x"], "rows": None}
      ...             }
      ...         ]
      ...     }
      ... ]