outliers_data_check#

Data check that checks if there are any outliers in input data by using IQR to determine score anomalies.

Module Contents#

Classes Summary#

OutliersDataCheck

Checks if there are any outliers in input data by using IQR to determine score anomalies.

Contents#

class evalml.data_checks.outliers_data_check.OutliersDataCheck[source]#

Checks if there are any outliers in input data by using IQR to determine score anomalies.

Columns with score anomalies are considered to contain outliers.

Methods

get_boxplot_data

Returns box plot information for the given data.

name

Return a name describing the data check.

validate

Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.

static get_boxplot_data(data_)[source]#

Returns box plot information for the given data.

Parameters

data (pd.Series, np.ndarray) – Input data.

Returns

A payload of box plot statistics.

Return type

dict

Examples

>>> import pandas as pd
...
>>> df = pd.DataFrame({
...     "x": [1, 2, 3, 4, 5],
...     "y": [6, 7, 8, 9, 10],
...     "z": [-1, -2, -3, -1201, -4]
... })
>>> box_plot_data = OutliersDataCheck.get_boxplot_data(df["z"])
>>> box_plot_data["score"] = round(box_plot_data["score"], 2)
>>> assert box_plot_data == {
...     "score": 0.89,
...     "pct_outliers": 0.2,
...     "values": {"q1": -4.0,
...                "median": -3.0,
...                "q3": -2.0,
...                "low_bound": -7.0,
...                "high_bound": -1.0,
...                "low_values": [-1201],
...                "high_values": [],
...                "low_indices": [3],
...                "high_indices": []}
...     }
name(cls)#

Return a name describing the data check.

validate(self, X, y=None)[source]#

Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.

Parameters
  • X (pd.DataFrame, np.ndarray) – Input features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

A dictionary with warnings if any columns have outliers.

Return type

dict

Examples

>>> import pandas as pd

The column “z” has an outlier so a warning is added to alert the user of its location.

>>> df = pd.DataFrame({
...     "x": [1, 2, 3, 4, 5],
...     "y": [6, 7, 8, 9, 10],
...     "z": [-1, -2, -3, -1201, -4]
... })
...
>>> outliers_check = OutliersDataCheck()
>>> assert outliers_check.validate(df) == [
...     {
...         "message": "Column(s) 'z' are likely to have outlier data.",
...         "data_check_name": "OutliersDataCheck",
...         "level": "warning",
...         "code": "HAS_OUTLIERS",
...         "details": {"columns": ["z"], "rows": [3], "column_indices": {"z": [3]}},
...         "action_options": [
...             {
...                 "code": "DROP_ROWS",
...                  "data_check_name": "OutliersDataCheck",
...                  "parameters": {},
...                  "metadata": {"rows": [3], "columns": None}
...             }
...         ]
...     }
... ]