uniqueness_data_check
==================================================

.. py:module:: evalml.data_checks.uniqueness_data_check

.. autoapi-nested-parse::

   Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.


Module Contents
---------------

Classes Summary
~~~~~~~~~~~~~~~

.. autoapisummary::

   evalml.data_checks.uniqueness_data_check.UniquenessDataCheck


Attributes Summary
~~~~~~~~~~~~~~~~~~~

.. autoapisummary::

   evalml.data_checks.uniqueness_data_check.warning_not_unique_enough
   evalml.data_checks.uniqueness_data_check.warning_too_unique


Contents
~~~~~~~~~~~~~~~~~~~
.. py:class:: UniquenessDataCheck(problem_type, threshold=0.5)


   Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

   :param problem_type: The specific problem type to data check for.
                        e.g. 'binary', 'multiclass', 'regression, 'time series regression'
   :type problem_type: str or ProblemTypes
   :param threshold: The threshold to set as an upper bound on uniqueness for classification type problems
                     or lower bound on for regression type problems.  Defaults to 0.50.
   :type threshold: float


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.data_checks.uniqueness_data_check.UniquenessDataCheck.name
      evalml.data_checks.uniqueness_data_check.UniquenessDataCheck.uniqueness_score
      evalml.data_checks.uniqueness_data_check.UniquenessDataCheck.validate

   .. py:method:: name(cls)

      Return a name describing the data check.


   .. py:method:: uniqueness_score(col, drop_na=True)
      :staticmethod:

      Calculate a uniqueness score for the provided field.  NaN values are not considered as unique values in the calculation.

      Based on the Herfindahl-Hirschman Index.

      :param col: Feature values.
      :type col: pd.Series
      :param drop_na: Whether to drop null values when computing the uniqueness score. Defaults to True.
      :type drop_na: bool

      :returns: Uniqueness score.
      :rtype: (float)


   .. py:method:: validate(self, X, y=None)

      Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

      :param X: Features.
      :type X: pd.DataFrame, np.ndarray
      :param y: Ignored.  Defaults to None.
      :type y: pd.Series, np.ndarray

      :returns:

                dict with a DataCheckWarning if there are any too unique or not
                    unique enough columns.
      :rtype: dict

      .. rubric:: Examples

      >>> import pandas as pd

      Because the problem type is regression, the column "regression_not_unique_enough" raises a warning
      for having just one value.

      >>> df = pd.DataFrame({
      ...    "regression_unique_enough": [float(x) for x in range(100)],
      ...    "regression_not_unique_enough": [float(1) for x in range(100)]
      ... })
      ...
      >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
      >>> assert uniqueness_check.validate(df) == [
      ...     {
      ...         "message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.",
      ...         "data_check_name": "UniquenessDataCheck",
      ...         "level": "warning",
      ...         "code": "NOT_UNIQUE_ENOUGH",
      ...         "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None},
      ...         "action_options": [
      ...             {
      ...                 "code": "DROP_COL",
      ...                 "parameters": {},
      ...                 "data_check_name": "UniquenessDataCheck",
      ...                 "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}
      ...             }
      ...         ]
      ...     }
      ... ]

      For multiclass, the column "regression_unique_enough" has too many unique values and will raise
      an appropriate warning.
      >>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3])
      >>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8)
      >>> assert uniqueness_check.validate(df) == [
      ...     {
      ...         "message": "Input columns 'regression_unique_enough' for multiclass problem type are too unique.",
      ...         "data_check_name": "UniquenessDataCheck",
      ...         "level": "warning",
      ...         "details": {
      ...             "columns": ["regression_unique_enough"],
      ...             "rows": None,
      ...             "uniqueness_score": {"regression_unique_enough": 0.99}
      ...         },
      ...         "code": "TOO_UNIQUE",
      ...         "action_options": [
      ...             {
      ...                 "code": "DROP_COL",
      ...                 "data_check_name": "UniquenessDataCheck",
      ...                 "parameters": {},
      ...                 "metadata": {"columns": ["regression_unique_enough"], "rows": None}
      ...             }
      ...         ]
      ...     }
      ... ]
      ...
      >>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625


.. py:data:: warning_not_unique_enough
   :annotation: = Input columns {} for {} problem type are not unique enough.

   
.. py:data:: warning_too_unique
   :annotation: = Input columns {} for {} problem type are too unique.