{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Understanding Data Check Actions\n", "\n", "EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is [data checks](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html), which help determine the health of our data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:\n", "\n", "- `NullDataCheck`: Checks whether the rows or columns are null or highly null\n", "\n", "- `IDColumnsDataCheck`: Checks for columns that could be ID columns\n", "\n", "- `TargetLeakageDataCheck`: Checks if any of the input features have high association with the targets\n", "\n", "- `InvalidTargetDataCheck`: Checks if there are null or other invalid values in the target\n", "\n", "- `NoVarianceDataCheck`: Checks if either the target or any features have no variance\n", "\n", "\n", "EvalML has additional data checks that can be seen [here](https://evalml.alteryx.com/en/stable/api_index.html#data-checks), with usage examples [here](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html). Below, we will walk through usage of EvalML's default data checks and actions.\n", "\n", "\n", "First, we import the necessary requirements to demonstrate these checks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import woodwork as ww\n", "import pandas as pd\n", "from evalml import AutoMLSearch\n", "from evalml.demos import load_fraud\n", "from evalml.preprocessing import split_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the input feature data. EvalML uses the [Woodwork](https://woodwork.alteryx.com/en/stable/) library to represent this data. The demo data that EvalML returns is a Woodwork DataTable and DataColumn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = load_fraud(n_rows=1500)\n", "X.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding noise and unclean data\n", "\n", "This data is already clean and compatible with EvalML's ``AutoMLSearch``. In order to demonstrate EvalML default data checks, we will add the following:\n", "\n", "- A column of mostly null values (<0.5% non-null)\n", "\n", "- A column with low/no variance\n", "\n", "- A row of null values\n", "\n", "- A missing target value\n", "\n", "\n", "We will add the first two columns to the whole dataset and we will only add the last two to the training data. Note: these only represent some of the scenarios that EvalML default data checks can catch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add a column with no variance in the data\n", "X[\"no_variance\"] = [1 for _ in range(X.shape[0])]\n", "\n", "# add a column with >99.5% null values\n", "X[\"mostly_nulls\"] = [None] * (X.shape[0] - 5) + [i for i in range(5)]\n", "\n", "# since we changed the data, let's reinitialize the woodwork datatable\n", "X.ww.init()\n", "# let's split some training and validation data\n", "X_train, X_valid, y_train, y_valid = split_data(X, y, problem_type=\"binary\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make row 1 all nan values\n", "X_train.iloc[1] = [None] * X_train.shape[1]\n", "\n", "# make one of the target values null\n", "y_train[990] = None\n", "\n", "X_train.ww.init()\n", "y_train = ww.init_series(y_train)\n", "# Let's take another look at the new X_train data\n", "X_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we call `AutoMLSearch.search()` on this data, the search will fail due to the columns and issues we've added above. Note: we use a try/except here to catch the resulting ValueError that AutoMLSearch raises." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type=\"binary\")\n", "try:\n", " automl.search()\n", "except ValueError as e:\n", " # to make the error message more distinct\n", " print(\"=\" * 80, \"\\n\")\n", " print(\"Search errored out! Message received is: {}\".format(e))\n", " print(\"=\" * 80, \"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `search_iterative()` function provided in EvalML to determine what potential health issues our data has. We can see that this [search_iterative](https://evalml.alteryx.com/en/latest/autoapi/evalml/automl/index.html#evalml.automl.search_iterative) function is a public method available through `evalml.automl` and is different from the [search](https://evalml.alteryx.com/en/stable/autoapi/evalml/automl/index.html#evalml.automl.AutoMLSearch) function of the `AutoMLSearch` class in EvalML. This `search_iterative()` function allows us to run the default data checks on the data, and, if there are no errors, automatically runs `AutoMLSearch.search()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.automl import search_iterative\n", "\n", "automl, messages = search_iterative(X_train, y_train, problem_type=\"binary\")\n", "automl, messages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The return value of the `search_iterative` function above is a tuple. The first element is the `AutoMLSearch` object if it runs (and `None` otherwise), and the second element is a dictionary of potential warnings and errors that the default data checks find on the passed-in `X` and `y` data. In this dictionary, warnings are suggestions that the data checks give that can useful to address to make the search better but will not break `AutoMLSearch`. On the flip side, errors indicate issues that will break `AutoMLSearch` and need to be addressed by the user.\n", "\n", "Above, we can see that there were errors so search did not automatically run." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Addressing warnings and errors\n", "\n", "We can automatically address the warnings and errors returned by ``search_iterative`` by using ``make_pipeline_from_data_check_output``, a utility method that creates a pipeline that will automatically clean up our data. We just need to pass this method the messages from running ``DataCheck.validate()`` and our problem type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines.utils import make_pipeline_from_data_check_output\n", "\n", "actions_pipeline = make_pipeline_from_data_check_output(\"binary\", messages)\n", "actions_pipeline.fit(X_train, y_train)\n", "X_train_cleaned, y_train_cleaned = actions_pipeline.transform(X_train, y_train)\n", "print(\n", " \"The new length of X_train is {} and y_train is {}\".format(\n", " len(X_train_cleaned), len(X_train_cleaned)\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can run ``search_iterative`` to completion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results_cleaned = search_iterative(\n", " X_train_cleaned, y_train_cleaned, problem_type=\"binary\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this time, we get an `AutoMLSearch` object returned to us as the first element of the tuple. We can use and inspect the `AutoMLSearch` object as needed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_object = results_cleaned[0]\n", "automl_object.rankings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we check the second element in the tuple, we can see that there are no longer any warnings or errors detected!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_check_results = results_cleaned[1]\n", "data_check_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Only addressing DataCheck errors\n", "\n", "\n", "Previously, we used `make_pipeline_from_actions` to address all of the warnings and errors returned by `search_iterative`. We will now show how we can also manually address errors to allow AutoMLSearch to run, and how ignoring warnings will come at the expense of performance.\n", "\n", "We can print out the errors first to make it easier to read, and then we'll create new features and targets from the original training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# copy the DataTables to new variables\n", "X_train_no_errors = X_train.copy()\n", "y_train_no_errors = y_train.copy()\n", "\n", "# We address the errors by looking at the resulting dictionary errors listed\n", "\n", "# let's address the `TARGET_HAS_NULL` error\n", "y_train_no_errors.fillna(False, inplace=True)\n", "\n", "# let's reinitialize the Woodwork DataTable\n", "X_train_no_errors.ww.init()\n", "X_train_no_errors.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now run search on `X_train_no_errors` and `y_train_no_errors`. Note that the search here doesn't fail since we addressed the errors, but there will still exist warnings in the returned tuple. This search allows the `mostly_nulls` column to remain in the features during search." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results_no_errors = search_iterative(\n", " X_train_no_errors, y_train_no_errors, problem_type=\"binary\"\n", ")\n", "results_no_errors" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }