sparsity_data_check
================================================

.. py:module:: evalml.data_checks.sparsity_data_check

.. autoapi-nested-parse::

   Data check that checks if there are any columns with sparsely populated values in the input.


Module Contents
---------------

Classes Summary
~~~~~~~~~~~~~~~

.. autoapisummary::

   evalml.data_checks.sparsity_data_check.SparsityDataCheck


Attributes Summary
~~~~~~~~~~~~~~~~~~~

.. autoapisummary::

   evalml.data_checks.sparsity_data_check.warning_too_unique


Contents
~~~~~~~~~~~~~~~~~~~
.. py:class:: SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)


   Check if there are any columns with sparsely populated values in the input.

   :param problem_type: The specific problem type to data check for.
                        'multiclass' or 'time series multiclass' is the only accepted problem type.
   :type problem_type: str or ProblemTypes
   :param threshold: The threshold value, or percentage of each column's unique values,
                     below which, a column exhibits sparsity.  Should be between 0 and 1.
   :type threshold: float
   :param unique_count_threshold: The minimum number of times a unique
                                  value has to be present in a column to not be considered "sparse."
                                  Defaults to 10.
   :type unique_count_threshold: int


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.data_checks.sparsity_data_check.SparsityDataCheck.name
      evalml.data_checks.sparsity_data_check.SparsityDataCheck.sparsity_score
      evalml.data_checks.sparsity_data_check.SparsityDataCheck.validate

   .. py:method:: name(cls)

      Return a name describing the data check.


   .. py:method:: sparsity_score(col, count_threshold=10)
      :staticmethod:

      Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

      :param col: Feature values.
      :type col: pd.Series
      :param count_threshold: The number of instances below which a value is considered sparse.
                              Default is 10.
      :type count_threshold: int

      :returns: Sparsity score, or the percentage of the unique values that exceed count_threshold.
      :rtype: (float)


   .. py:method:: validate(self, X, y=None)

      Calculate what percentage of each column's unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

      :param X: Features.
      :type X: pd.DataFrame, np.ndarray
      :param y: Ignored.
      :type y: pd.Series, np.ndarray

      :returns: dict with a DataCheckWarning if there are any sparse columns.
      :rtype: dict

      .. rubric:: Examples

      >>> import pandas as pd

      For multiclass problems, if a column doesn't have enough representation from unique values, it will be considered sparse.

      >>> df = pd.DataFrame({
      ...    "sparse": [float(x) for x in range(100)],
      ...    "not_sparse": [float(1) for x in range(100)]
      ... })
      ...
      >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
      >>> assert sparsity_check.validate(df) == [
      ...     {
      ...         "message": "Input columns ('sparse') for multiclass problem type are too sparse.",
      ...         "data_check_name": "SparsityDataCheck",
      ...         "level": "warning",
      ...         "code": "TOO_SPARSE",
      ...         "details": {
      ...             "columns": ["sparse"],
      ...             "sparsity_score": {"sparse": 0.0},
      ...             "rows": None
      ...         },
      ...         "action_options": [
      ...             {
      ...                 "code": "DROP_COL",
      ...                  "data_check_name": "SparsityDataCheck",
      ...                  "parameters": {},
      ...                  "metadata": {"columns": ["sparse"], "rows": None}
      ...             }
      ...         ]
      ...     }
      ... ]

      ...
      >>> df["sparse"] = [float(x % 10) for x in range(100)]
      >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=1, unique_count_threshold=5)
      >>> assert sparsity_check.validate(df) == []
      ...
      >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3)
      >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666


.. py:data:: warning_too_unique
   :annotation: = Input columns ({}) for {} problem type are too sparse.