outliers_data_check ================================================ .. py:module:: evalml.data_checks.outliers_data_check .. autoapi-nested-parse:: Data check that checks if there are any outliers in input data by using IQR to determine score anomalies. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.outliers_data_check.OutliersDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: OutliersDataCheck Checks if there are any outliers in input data by using IQR to determine score anomalies. Columns with score anomalies are considered to contain outliers. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.outliers_data_check.OutliersDataCheck.get_boxplot_data evalml.data_checks.outliers_data_check.OutliersDataCheck.name evalml.data_checks.outliers_data_check.OutliersDataCheck.validate .. py:method:: get_boxplot_data(data_) :staticmethod: Returns box plot information for the given data. :param data_: Input data. :type data_: pd.Series, np.ndarray :returns: A payload of box plot statistics. :rtype: dict .. rubric:: Examples >>> import pandas as pd ... >>> df = pd.DataFrame({ ... "x": [1, 2, 3, 4, 5], ... "y": [6, 7, 8, 9, 10], ... "z": [-1, -2, -3, -1201, -4] ... }) >>> box_plot_data = OutliersDataCheck.get_boxplot_data(df["z"]) >>> box_plot_data["score"] = round(box_plot_data["score"], 2) >>> assert box_plot_data == { ... "score": 0.89, ... "pct_outliers": 0.2, ... "values": {"q1": -4.0, ... "median": -3.0, ... "q3": -2.0, ... "low_bound": -7.0, ... "high_bound": -1.0, ... "low_values": [-1201], ... "high_values": [], ... "low_indices": [3], ... "high_indices": []} ... } .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers. :param X: Input features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: A dictionary with warnings if any columns have outliers. :rtype: dict .. rubric:: Examples >>> import pandas as pd The column "z" has an outlier so a warning is added to alert the user of its location. >>> df = pd.DataFrame({ ... "x": [1, 2, 3, 4, 5], ... "y": [6, 7, 8, 9, 10], ... "z": [-1, -2, -3, -1201, -4] ... }) ... >>> outliers_check = OutliersDataCheck() >>> assert outliers_check.validate(df) == [ ... { ... "message": "Column(s) 'z' are likely to have outlier data.", ... "data_check_name": "OutliersDataCheck", ... "level": "warning", ... "code": "HAS_OUTLIERS", ... "details": {"columns": ["z"], "rows": [3], "column_indices": {"z": [3]}}, ... "action_options": [ ... { ... "code": "DROP_ROWS", ... "data_check_name": "OutliersDataCheck", ... "parameters": {}, ... "metadata": {"rows": [3], "columns": None} ... } ... ] ... } ... ]