{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Checks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EvalML provides data checks to help guide you in achieving the highest performing model. These utility functions help deal with problems such as overfitting, abnormal data, and missing data. These data checks can be found under `evalml/data_checks`. Below we will cover examples for each available data check in EvalML, as well as the `DefaultDataChecks` collection of data checks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Data\n", "\n", "Missing data or rows with `NaN` values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation [components](../user_guide/components.ipynb) to ensure that doesn't happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values, a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By using `NullDataCheck`, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from evalml.data_checks import NullDataCheck\n", "\n", "X = pd.DataFrame(\n", " [[1, 2, 3], [0, 4, np.nan], [1, 4, np.nan], [9, 4, np.nan], [8, 6, np.nan]]\n", ")\n", "\n", "null_check = NullDataCheck(pct_null_col_threshold=0.8, pct_null_row_threshold=0.8)\n", "messages = null_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Abnormal Data\n", "\n", "EvalML provides a few data checks to check for abnormal data: \n", "\n", "* `NoVarianceDataCheck`\n", "* `ClassImbalanceDataCheck`\n", "* `TargetLeakageDataCheck`\n", "* `InvalidTargetDataCheck`\n", "* `IDColumnsDataCheck`\n", "* `OutliersDataCheck`\n", "* `HighVarianceCVDataCheck`\n", "* `MulticollinearityDataCheck`\n", "* `UniquenessDataCheck`\n", "* `TargetDistributionDataCheck`\n", "* `DateTimeFormatDataCheck`\n", "* `TimeSeriesParametersDataCheck`\n", "* `TimeSeriesSplittingDataCheck`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zero Variance\n", "\n", "Data with zero variance indicates that all values are identical. If a feature has zero variance, it is not likely to be a useful feature. Similarly, if the target has zero variance, there is likely something wrong. `NoVarianceDataCheck` checks if the target or any feature has only one unique value and alerts you to any such columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import NoVarianceDataCheck\n", "\n", "X = pd.DataFrame({\"no var col\": [0, 0, 0], \"good col\": [0, 4, 1]})\n", "y = pd.Series([1, 0, 1])\n", "no_variance_data_check = NoVarianceDataCheck()\n", "messages = no_variance_data_check.validate(X, y)\n", "\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you can set `NaN` to count as an unique value, but `NoVarianceDataCheck` will still return a warning if there is only one unique non-`NaN` value in a given column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import NoVarianceDataCheck\n", "\n", "X = pd.DataFrame(\n", " {\n", " \"no var col\": [0, 0, 0],\n", " \"no var col with nan\": [1, np.nan, 1],\n", " \"good col\": [0, 4, 1],\n", " }\n", ")\n", "y = pd.Series([1, 0, 1])\n", "\n", "no_variance_data_check = NoVarianceDataCheck(count_nan_as_value=True)\n", "messages = no_variance_data_check.validate(X, y)\n", "\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Class Imbalance\n", "\n", "For classification problems, the distribution of examples across each class can vary. For small variations, this is normal and expected. However, when the number of examples for each class label is disproportionately biased or skewed towards a particular class (or classes), it can be difficult for machine learning models to predict well. In addition, having a low number of examples for a given class could mean that one or more of the CV folds generated for the training data could only have few or no examples from that class. This may cause the model to only predict the majority class and ultimately resulting in a poor-performant model.\n", "\n", "`ClassImbalanceDataCheck` checks if the target labels are imbalanced beyond a specified threshold for a certain number of CV folds. It returns `DataCheckError` messages for any classes that have less samples than double the number of CV folds specified (since that indicates the likelihood of having at little to no samples of that class in a given fold), and `DataCheckWarning` messages for any classes that fall below the set threshold percentage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import ClassImbalanceDataCheck\n", "\n", "X = pd.DataFrame([[1, 2, 0, 1], [4, 1, 9, 0], [4, 4, 8, 3], [9, 2, 7, 1]])\n", "y = pd.Series([0, 1, 1, 1, 1])\n", "\n", "class_imbalance_check = ClassImbalanceDataCheck(threshold=0.25, num_cv_folds=4)\n", "messages = class_imbalance_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Target Leakage\n", "\n", "[Target leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)#:~:text=In%20statistics%20and%20machine%20learning,run%20in%20a%20production%20environment.), also known as data leakage, can occur when you train your model on a dataset that includes information that should not be available at the time of prediction. This causes the model to score suspiciously well, but perform poorly in production. `TargetLeakageDataCheck` checks for features that could potentially be \"leaking\" information by calculating the Pearson correlation coefficient between each feature and the target to warn users if there are features are highly correlated with the target. Currently, only numerical features are considered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import TargetLeakageDataCheck\n", "\n", "X = pd.DataFrame(\n", " {\n", " \"leak\": [10, 42, 31, 51, 61] * 5,\n", " \"x\": [42, 54, 12, 64, 12] * 5,\n", " \"y\": [12, 5, 13, 74, 24] * 5,\n", " }\n", ")\n", "y = pd.Series([10, 42, 31, 51, 40] * 5)\n", "\n", "target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8)\n", "messages = target_leakage_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invalid Target Data\n", "\n", "The `InvalidTargetDataCheck` checks if the target data contains any missing or invalid values. Specifically:\n", "\n", "* if any of the target values are missing, a `DataCheckError` message is returned\n", "* if the specified problem type is a binary classification problem but there is more or less than two unique values in the target, a `DataCheckError` message is returned\n", "* if binary classification target classes are numeric values not equal to {0, 1}, a `DataCheckError` message is returned because it can cause unpredictable behavior when passed to pipelines" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import InvalidTargetDataCheck\n", "\n", "X = pd.DataFrame({})\n", "y = pd.Series([0, 1, None, None])\n", "\n", "invalid_target_check = InvalidTargetDataCheck(\"binary\", \"Log Loss Binary\")\n", "messages = invalid_target_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ID Columns\n", "\n", "ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, `IDColumnsDataCheck` reminds you if these columns exists. In the given example, 'user_number' and 'revenue_id' columns are both identified as potentially being unique identifiers that should be removed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import IDColumnsDataCheck\n", "\n", "X = pd.DataFrame(\n", " [[0, 53, 6325, 5], [1, 90, 6325, 10], [2, 90, 18, 20]],\n", " columns=[\"user_number\", \"cost\", \"revenue\", \"revenue_id\"],\n", ")\n", "\n", "id_col_check = IDColumnsDataCheck(id_threshold=0.9)\n", "messages = id_col_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Primary key columns however, can be useful. Primary key columns are typically the first column in the dataset, have all unique values, and are either named `ID` or a name that ends with `_id`. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if it finds that the first column of the DataFrame is a primary key. In the given example, `user_id` is identified as a primary key, while `revenue_id` was identified as a regular unique identifier. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import IDColumnsDataCheck\n", "\n", "X = pd.DataFrame(\n", " [[0, 53, 6325, 5], [1, 90, 6325, 10], [2, 90, 18, 20]],\n", " columns=[\"user_id\", \"cost\", \"revenue\", \"revenue_id\"],\n", ")\n", "\n", "id_col_check = IDColumnsDataCheck(id_threshold=0.9)\n", "messages = id_col_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multicollinearity\n", "\n", "The `MulticollinearityDataCheck` data check is used in to detect if are any set of features that are likely to be multicollinear. Multicollinear features affect the performance of a model, but more importantly, it may greatly impact model interpretation. EvalML uses mutual information to determine collinearity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import MulticollinearityDataCheck\n", "\n", "y = pd.Series([1, 0, 2, 3, 4] * 5)\n", "X = pd.DataFrame(\n", " {\n", " \"col_1\": y,\n", " \"col_2\": y * 3,\n", " \"col_3\": ~y,\n", " \"col_4\": y / 2,\n", " \"col_5\": y + 1,\n", " \"not_collinear\": [0, 1, 0, 0, 0] * 5,\n", " }\n", ")\n", "\n", "multi_check = MulticollinearityDataCheck(threshold=0.95)\n", "messages = multi_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uniqueness\n", "The `UniquenessDataCheck` is used to detect columns with either too unique or not unique enough values. For regression type problems, the data is checked for a lower limit of uniqueness. For multiclass type problems, the data is checked for an upper limit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from evalml.data_checks import UniquenessDataCheck\n", "\n", "X = pd.DataFrame(\n", " {\n", " \"most_unique\": [float(x) for x in range(10)], # [0,1,2,3,4,5,6,7,8,9]\n", " \"more_unique\": [x % 5 for x in range(10)], # [0,1,2,3,4,0,1,2,3,4]\n", " \"unique\": [x % 3 for x in range(10)], # [0,1,2,0,1,2,0,1,2,0]\n", " \"less_unique\": [x % 2 for x in range(10)], # [0,1,0,1,0,1,0,1,0,1]\n", " \"not_unique\": [float(1) for x in range(10)],\n", " }\n", ") # [1,1,1,1,1,1,1,1,1,1]\n", "\n", "uniqueness_check = UniquenessDataCheck(problem_type=\"regression\", threshold=0.5)\n", "messages = uniqueness_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sparsity\n", "The `SparsityDataCheck` is used to identify features that contain a sparsity of values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import SparsityDataCheck\n", "\n", "X = pd.DataFrame(\n", " {\n", " \"most_sparse\": [float(x) for x in range(10)], # [0,1,2,3,4,5,6,7,8,9]\n", " \"more_sparse\": [x % 5 for x in range(10)], # [0,1,2,3,4,0,1,2,3,4]\n", " \"sparse\": [x % 3 for x in range(10)], # [0,1,2,0,1,2,0,1,2,0]\n", " \"less_sparse\": [x % 2 for x in range(10)], # [0,1,0,1,0,1,0,1,0,1]\n", " \"not_sparse\": [float(1) for x in range(10)],\n", " }\n", ") # [1,1,1,1,1,1,1,1,1,1]\n", "\n", "\n", "sparsity_check = SparsityDataCheck(\n", " problem_type=\"multiclass\", threshold=0.4, unique_count_threshold=3\n", ")\n", "messages = sparsity_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outliers\n", "\n", "Outliers are observations that differ significantly from other observations in the same sample. Many machine learning pipelines suffer in performance if outliers are not dropped from the training set as they are not representative of the data. `OutliersDataCheck()` uses IQR to notify you if a sample can be considered an outlier.\n", "\n", "Below we generate a random dataset with some outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = np.tile(np.arange(10) * 0.01, (100, 10))\n", "X = pd.DataFrame(data=data)\n", "\n", "# generate some outliers in columns 3, 25, 55, and 72\n", "X.iloc[0, 3] = -10000\n", "X.iloc[3, 25] = 10000\n", "X.iloc[5, 55] = 10000\n", "X.iloc[10, 72] = -10000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then utilize `OutliersDataCheck()` to rediscover these outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import OutliersDataCheck\n", "\n", "outliers_check = OutliersDataCheck()\n", "messages = outliers_check.validate(X)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Target Distribution\n", "\n", "Target data can come in a variety of distributions, such as Gaussian or Lognormal. When we work with machine learning models, we feed data into an estimator that learns from the training data provided. Sometimes the data can be significantly spread out with a long tail or outliers, which could lead to a lognormal distribution. This can cause machine learning model performance to suffer.\n", "\n", "To help the estimators better understand the underlying relationships in the data between the features and the target, we can use the `TargetDistributionDataCheck` to identify such a distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import lognorm\n", "from evalml.data_checks import TargetDistributionDataCheck\n", "\n", "data = np.tile(np.arange(10) * 0.01, (100, 10))\n", "X = pd.DataFrame(data=data)\n", "y = pd.Series(lognorm.rvs(s=0.4, loc=1, scale=1, size=100))\n", "\n", "target_dist_check = TargetDistributionDataCheck()\n", "messages = target_dist_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Datetime Format\n", "\n", "Datetime information is a necessary component of time series problems, but sometimes the data we deal with may contain flaws that make it impossible for time series models to work with them. For example, in order to identify a frequency in the datetime information there has to be equal interval spacing between data points i.e. January 1, 2021, January 3, 2021, January 5, 2021, ...etc which are separated by two days. If instead there are random jumps in the datetime data i.e. January 1, 2021, January 3, 2021, January 12, 2021, then a frequency can't be inferred. Another common issue with time series models are that they can't handle datetime information that isn't properly sorted. Datetime values that aren't monotonically increasing (sorted in ascending order) will encounter this issue and their frequency cannot be inferred.\n", "\n", "To make it easy to verify that the datetime column you're working with is properly spaced and sorted, we can leverage the `DatetimeFormatDataCheck`. When initializing the data check, pass in the name of the column that contains your datetime information (or pass in \"index\" if it's found in either your X or y indices)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import DateTimeFormatDataCheck\n", "\n", "X = pd.DataFrame(\n", " pd.date_range(\"January 1, 2021\", periods=8, freq=\"2D\"), columns=[\"dates\"]\n", ")\n", "y = pd.Series([1, 2, 4, 2, 1, 2, 3, 1])\n", "\n", "# Replaces the last entry with January 16th instead of January 15th\n", "# so that the data is no longer evenly spaced.\n", "X.iloc[7] = \"January 16, 2021\"\n", "\n", "datetime_format_check = DateTimeFormatDataCheck(datetime_column=\"dates\")\n", "messages = datetime_format_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])\n", "\n", "print(\"--------------------------------\")\n", "\n", "# Reverses the order of the index datetime values to be decreasing.\n", "X = X[::-1]\n", "messages = datetime_format_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time Series Parameters\n", "\n", "In order to support time series problem types in AutoML, certain conditions have to be met.\n", " - The parameters `gap`, `max_delay`, `forecast_horizon`, and `time_index` have to be passed in to `problem_configuration`.\n", " - The values of `gap`, `max_delay`, `forecast_horizon` have to be appropriate for the size of the data.\n", "\n", "For point 2 above, this means that the window size (as defined by `gap` + `max_delay` + `forecast_horizon`) has to be less than the number of observations in the data divided by the number of splits + 1. For example, with 100 observations and 3 splits, the split size would be 25. This means that the window size has to be less than 25." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import TimeSeriesParametersDataCheck\n", "\n", "X = pd.DataFrame(pd.date_range(\"1/1/21\", periods=100), columns=[\"dates\"])\n", "y = pd.Series([i % 2 for i in range(100)])\n", "\n", "problem_config = {\n", " \"gap\": 1,\n", " \"max_delay\": 23,\n", " \"forecast_horizon\": 1,\n", " \"time_index\": \"dates\",\n", "}\n", "\n", "# With 3 splits, the split size will be 25 (100/3+1)\n", "# Since gap + max_delay + forecast_horizon is 25, this will\n", "# throw an error for window size.\n", "ts_params_data_check = TimeSeriesParametersDataCheck(\n", " problem_configuration=problem_config, n_splits=3\n", ")\n", "messages = ts_params_data_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time Series Splitting\n", "\n", "Due to the nature of time series data, splitting cannot involve shuffling and has to be done in a sequential manner. This means splitting the data into `n_splits` + 1 different sections and increasing the size of the training data by the split size every iteration while keeping the test size equal to the split size.\n", "\n", "For every split in the data, the training and validation segments must contain target data that has an example of every class found in the entire target set for time series binary and time series multiclass problems. The reason for this is that many classification machine learning models run into issues if they're trained on data that doesn't contain an instance of a class but then the model is expected to be able to predict for it. For example, with 3 splits and a split size of 25, this means that every training/validation split: (0:25)/(25:50), (0:50)/(50:75), (0:75)/(75:100) must contain at least one instance of all unique target classes in the training and validation set.\n", " - At least one instance of both classes in a time series binary problem.\n", " - At least one instance of all classes in a time series multiclass problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import TimeSeriesSplittingDataCheck\n", "\n", "X = None\n", "y = pd.Series([0 if i < 50 else i % 2 for i in range(100)])\n", "\n", "ts_splitting_check = TimeSeriesSplittingDataCheck(\"time series binary\", 3)\n", "messages = ts_splitting_check.validate(X, y)\n", "\n", "errors = [message for message in messages if message[\"level\"] == \"error\"]\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])\n", "\n", "for error in errors:\n", " print(\"Error:\", error[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Check Messages\n", "\n", "Each data check's `validate` method returns a list of `DataCheckMessage` objects indicating warnings or errors found; warnings are stored as a `DataCheckWarning` [object](../autoapi/evalml/data_checks/index.rst#evalml.data_checks.DataCheckWarning) and errors are stored as a `DataCheckError` [object](../autoapi/evalml/data_checks/index.rst#evalml.data_checks.DataCheckError). You can filter the messages returned by a data check by checking for the type of message returned. Below, `NoVarianceDataCheck` returns a list containing a `DataCheckWarning` and a `DataCheckError` message. We can determine which is which by checking the type of each message." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import NoVarianceDataCheck, DataCheckWarning\n", "\n", "X = pd.DataFrame(\n", " {\n", " \"no var col\": [0, 0, 0],\n", " \"no var col with nan\": [1, np.nan, 1],\n", " \"good col\": [0, 4, 1],\n", " }\n", ")\n", "y = pd.Series([1, 0, 1])\n", "\n", "no_variance_data_check = NoVarianceDataCheck(count_nan_as_value=True)\n", "messages = no_variance_data_check.validate(X, y)\n", "\n", "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n", "\n", "for warning in warnings:\n", " print(\"Warning:\", warning[\"message\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing Your Own Data Check" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would prefer to write your own data check, you can do so by extending the `DataCheck` class and implementing the `validate(self, X, y)` class method. Below, we've created a new `DataCheck`, `ZeroVarianceDataCheck`, which is similar to `NoVarianceDataCheck` defined in `EvalML`. The `validate(self, X, y)` method should return a dictionary with 'warnings' and 'errors' as keys mapping to list of warnings and errors, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import DataCheck\n", "\n", "\n", "class ZeroVarianceDataCheck(DataCheck):\n", " def validate(self, X, y):\n", " messages = []\n", " if not isinstance(X, pd.DataFrame):\n", " X = pd.DataFrame(X)\n", " warning_msg = \"Column '{}' has zero variance\"\n", " messages.extend(\n", " [\n", " DataCheckError(warning_msg.format(column), self.name)\n", " for column in X.columns\n", " if len(X[column].unique()) == 1\n", " ]\n", " )\n", " return messages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Collections of Data Checks\n", "\n", "For convenience, EvalML provides a `DataChecks` class to represent a collection of data checks. We will go over `DefaultDataChecks` ([API reference](../autoapi/evalml/data_checks/index.rst#evalml.data_checks.DefaultDataChecks)), a collection defined to check for some of the most common data issues." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Default Data Checks\n", "\n", "`DefaultDataChecks` is a collection of data checks defined to check for some of the most common data issues. They include:\n", "\n", "* `NullDataCheck`\n", "* `IDColumnsDataCheck`\n", "* `TargetLeakageDataCheck`\n", "* `InvalidTargetDataCheck`\n", "* `TargetDistributionDataCheck` (for regression problem types)\n", "* `ClassImbalanceDataCheck` (for classification problem types)\n", "* `NoVarianceDataCheck`\n", "* `DateTimeFormatDataCheck` (for time series problem types)\n", "* `TimeSeriesParametersDataCheck` (for time series problem types)\n", "* `TimeSeriesSplittingDataCheck` (for time series classification problem types)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing Your Own Collection of Data Checks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would prefer to create your own collection of data checks, you could either write your own data checks class by extending the `DataChecks` class and setting the `self.data_checks` attribute to the list of `DataCheck` classes or objects, or you could pass that list of data checks to the constructor of the `DataChecks` class. Below, we create two identical collections of data checks using the two different methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a subclass of `DataChecks`\n", "from evalml.data_checks import (\n", " DataChecks,\n", " NullDataCheck,\n", " InvalidTargetDataCheck,\n", " NoVarianceDataCheck,\n", " ClassImbalanceDataCheck,\n", " TargetLeakageDataCheck,\n", ")\n", "from evalml.problem_types import ProblemTypes, handle_problem_types\n", "\n", "\n", "class MyCustomDataChecks(DataChecks):\n", " data_checks = [\n", " NullDataCheck,\n", " InvalidTargetDataCheck,\n", " NoVarianceDataCheck,\n", " TargetLeakageDataCheck,\n", " ]\n", "\n", " def __init__(self, problem_type, objective):\n", " \"\"\"\n", " A collection of basic data checks.\n", " Args:\n", " problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass.\n", " \"\"\"\n", " if handle_problem_types(problem_type) == ProblemTypes.REGRESSION:\n", " super().__init__(\n", " self.data_checks,\n", " data_check_params={\n", " \"InvalidTargetDataCheck\": {\n", " \"problem_type\": problem_type,\n", " \"objective\": objective,\n", " }\n", " },\n", " )\n", " else:\n", " super().__init__(\n", " self.data_checks + [ClassImbalanceDataCheck],\n", " data_check_params={\n", " \"InvalidTargetDataCheck\": {\n", " \"problem_type\": problem_type,\n", " \"objective\": objective,\n", " }\n", " },\n", " )\n", "\n", "\n", "custom_data_checks = MyCustomDataChecks(\n", " problem_type=ProblemTypes.REGRESSION, objective=\"R2\"\n", ")\n", "for data_check in custom_data_checks.data_checks:\n", " print(data_check.name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pass list of data checks to the `data_checks` parameter of DataChecks\n", "same_custom_data_checks = DataChecks(\n", " data_checks=[\n", " NullDataCheck,\n", " InvalidTargetDataCheck,\n", " NoVarianceDataCheck,\n", " TargetLeakageDataCheck,\n", " ],\n", " data_check_params={\n", " \"InvalidTargetDataCheck\": {\n", " \"problem_type\": ProblemTypes.REGRESSION,\n", " \"objective\": \"R2\",\n", " }\n", " },\n", ")\n", "for data_check in custom_data_checks.data_checks:\n", " print(data_check.name)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }