{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"nbsphinx": "hidden"
},
"outputs": [],
"source": [
"import warnings\n",
"from statsmodels.tools.sm_exceptions import ConvergenceWarning\n",
"from sklearn.exceptions import ConvergenceWarning as sklearnConvergenceWarning\n",
"from evalml.exceptions import ParameterNotUsedWarning\n",
"\n",
"warnings_to_ignore = [\n",
" ConvergenceWarning,\n",
" sklearnConvergenceWarning,\n",
" ParameterNotUsedWarning,\n",
"]\n",
"for warning in warnings_to_ignore:\n",
" warnings.simplefilter(\"ignore\", warning)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AutoMLSearch for time series problems\n",
"\n",
"In this guide, we'll show how you can use EvalML to perform an automated search of machine learning pipelines for time series problems. Time series support is still being actively developed in EvalML so expect this page to improve over time.\n",
"\n",
"## But first, what is a time series?\n",
"\n",
"A time series is a series of measurements taken at different moments in time ([Wikipedia](https://en.wikipedia.org/wiki/Time_series)). The main difference between a time series dataset and a normal dataset is that the rows of a time series dataset are ordered chronologically, where the relative time between rows is significant. This relationship between the rows does not exist in non-time series datasets. In a non-time-series dataset, you can shuffle the rows and the dataset still has the same meaning. If you shuffle the rows of a time series dataset, the relationship between the rows is completely different!\n",
"\n",
"## What does AutoMLSearch for time series do?\n",
"In a machine learning setting, we are usually interested in using past values of the time series to predict future values. That is what EvalML's time series functionality is built to do. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading the data\n",
"\n",
"In this guide, we work with daily minimum temperature recordings from Melbourne, Austrailia from the beginning of 1981 to end of 1990.\n",
"\n",
"We start by loading the temperature data into two splits. The first split will be a training split consisting of data from 1981 to end of 1989. This is the data we'll use to find the best pipeline with AutoML. The second split will be a testing split consisting of data from 1990. This is the split we'll use to evaluate how well our pipeline generalizes on unseen data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from evalml.demos import load_weather\n",
"\n",
"X, y = load_weather()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_dates, test_dates = X.Date < \"1990-01-01\", X.Date >= \"1990-01-01\"\n",
"X_train, y_train = X.ww.loc[train_dates], y.ww.loc[train_dates]\n",
"X_test, y_test = X.ww.loc[test_dates], y.ww.loc[test_dates]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualizing the training set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import plotly.graph_objects as go"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = [\n",
" go.Scatter(\n",
" x=X_train[\"Date\"],\n",
" y=y_train,\n",
" mode=\"lines+markers\",\n",
" name=\"Temperature (C)\",\n",
" line=dict(color=\"#1f77b4\"),\n",
" )\n",
"]\n",
"# Let plotly pick the best date format.\n",
"layout = go.Layout(\n",
" title={\"text\": \"Min Daily Temperature, Melbourne 1980-1989\"},\n",
" xaxis={\"title\": \"Time\"},\n",
" yaxis={\"title\": \"Temperature (C)\"},\n",
")\n",
"\n",
"go.Figure(data=data, layout=layout)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fixing the data\n",
"\n",
"Sometimes, the datasets we work with do not have perfectly consistent DateTime columns. We can use the `TimeSeriesRegularizer` and `TimeSeriesImputer` to correct any discrepancies in our data in a time-series specific way.\n",
"\n",
"To show an example of this, let's create some discrepancies in our training data. We'll also add a couple of extra columns in the X DataFrame to highlight more of the options with these tools."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X[\"Categorical\"] = [str(i % 4) for i in range(len(X))]\n",
"X[\"Categorical\"] = X[\"Categorical\"].astype(\"category\")\n",
"X[\"Numeric\"] = [i for i in range(len(X))]\n",
"\n",
"# Re-split the data since we modified X\n",
"X_train, y_train = X.loc[train_dates], y.ww.loc[train_dates]\n",
"X_test, y_test = X.loc[test_dates], y.ww.loc[test_dates]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train[\"Date\"][500] = None\n",
"X_train[\"Date\"][1042] = None\n",
"X_train[\"Date\"][1043] = None\n",
"X_train[\"Date\"][231] = pd.Timestamp(\"1981-08-19\")\n",
"\n",
"X_train.drop(1209, inplace=True)\n",
"X_train.drop(398, inplace=True)\n",
"y_train.drop(1209, inplace=True)\n",
"y_train.drop(398, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With these changes, there are now NaN values in the training data that our models won't be able to recognize, and there is no longer a clear frequency between the dates."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Inferred frequency: {pd.infer_freq(X_train['Date'])}\")\n",
"print(f\"NaNs in date column: {X_train['Date'].isna().any()}\")\n",
"print(\n",
" f\"NaNs in other training data columns: {X_train['Categorical'].isna().any() or X_train['Numeric'].isna().any()}\"\n",
")\n",
"print(f\"NaNs in target data: {y_train.isna().any()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time Series Regularizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the `TimeSeriesRegularizer` component to restore the missing and NaN DateTime values we've created in our data. This component is designed to infer the proper frequency using [Woodwork's \"infer_frequency\"](https://woodwork.alteryx.com/en/stable/generated/woodwork.statistics_utils.infer_frequency.html) function and generate a new DataFrame that follows it. In order to maintain as much original information from the input data as possible, all rows with completely correct times are ported over into this new DataFrame. If there are any rows that have the same timestamp as another, these will be dropped. The first occurrence of a date or time maintains priority. If there are any values that don't quite line up with the inferred frequency they will be shifted to any closely missing datetimes, or dropped if there are none nearby."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evalml.pipelines.components import TimeSeriesRegularizer\n",
"\n",
"regularizer = TimeSeriesRegularizer(time_index=\"Date\")\n",
"X_train, y_train = regularizer.fit_transform(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can see that pandas has successfully inferred the frequency of the training data, and there are no more null values within `X_train`. However, by adding values that were dropped before, we have created NaN values within the target data, as well as the other columns in our training data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Inferred frequency: {pd.infer_freq(X_train['Date'])}\")\n",
"print(f\"NaNs in training data: {X_train['Date'].isna().any()}\")\n",
"print(\n",
" f\"NaNs in other training data columns: {X_train['Categorical'].isna().any() or X_train['Numeric'].isna().any()}\"\n",
")\n",
"print(f\"NaNs in target data: {y_train.isna().any()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time Series Imputer\n",
"\n",
"We could easily use the `Imputer` and `TargetImputer` components to fill in the missing gaps in our `X` and `y` data. However, these tools are not built for time series problems. Their supported imputation strategies of \"mean\", \"most_frequent\", or similar are all static. They don't account for the passing of time, and how neighboring data points may have more predictive power than simply taking the average. The `TimeSeriesImputer` solves this problem by offering three different imputation strategies:\n",
"- \"forwards_fill\": fills in any NaN values with the same value as found in the previous non-NaN cell.\n",
"- \"backwards_fill\": fills in any NaN values with the same value as found in the next non-NaN cell.\n",
"- \"interpolate\": (numeric columns only) fills in any NaN values by linearly interpolating the values of the previous and next non-NaN cells."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evalml.pipelines.components import TimeSeriesImputer\n",
"\n",
"ts_imputer = TimeSeriesImputer(\n",
" categorical_impute_strategy=\"forwards_fill\",\n",
" numeric_impute_strategy=\"backwards_fill\",\n",
" target_impute_strategy=\"interpolate\",\n",
")\n",
"X_train, y_train = ts_imputer.fit_transform(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, finally, we have a DataFrame that's back in order without flaws, which we can use for running AutoMLSearch and running models without issue."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Inferred frequency: {pd.infer_freq(X_train['Date'])}\")\n",
"print(f\"NaNs in training data: {X_train['Date'].isna().any()}\")\n",
"print(\n",
" f\"NaNs in other training data columns: {X_train['Categorical'].isna().any() or X_train['Numeric'].isna().any()}\"\n",
")\n",
"print(f\"NaNs in target data: {y_train.isna().any()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Trending and Seasonality Decomposition\n",
"Decomposing a target signal into a trend and/or a cyclical signal is a common pre-processing step for time series modeling. Having an understanding of the presence or absence of these component signals can provide additional insight and decomposing the signal into these constituent components can enable non-time-series aware estimators to perform better while attempting to model this data. We have two unique decomposers, the `PolynomialDecompser` and the `STLDecomposer`.\n",
"\n",
"Let's first take a look at a year's worth of the weather dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"length = 365\n",
"X_train_time = X_train.set_index(\"Date\").asfreq(pd.infer_freq(X_train[\"Date\"]))\n",
"y_train_time = y_train.set_axis(X_train[\"Date\"]).asfreq(pd.infer_freq(X_train[\"Date\"]))\n",
"plt.plot(y_train_time[0:length], \"bo\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the knowledge that this is a weather dataset and the data itself is daily weather data, we can assume that the seasonal data will have a period of approximately 365 data points. Let's build and fit decomposers to detrend and deseasonalize this data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Polynomial Decomposer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evalml.pipelines.components.transformers.preprocessing.polynomial_decomposer import (\n",
" PolynomialDecomposer,\n",
")\n",
"\n",
"pdc = PolynomialDecomposer(degree=1, period=365)\n",
"X_t, y_t = pdc.fit_transform(X_train_time, y_train_time)\n",
"\n",
"plt.plot(y_train_time, \"bo\", label=\"Signal\")\n",
"plt.plot(y_t, \"rx\", label=\"Detrended/Deseasonalized Signal\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is the residual signal, with the trend and seasonality removed. If we want to look at what the component identified as the trend and seasonality, we can call the `plot_decomposition()` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"res = pdc.plot_decomposition(X_train_time, y_train_time)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is desirable to enhance the decomposer component with a guess at the period of the seasonal aspect of the signal before decomposing it. To do that, we can use the `determine_periodicity(X, y)` function of the `Decomposer` class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"period = pdc.determine_periodicity(X_train_time, y_train_time)\n",
"print(period)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `PolynomialDecomposer` class, if not explicitly set in the constructor, will set its `period` parameter\n",
"based on a `statsmodels` function `freq_to_period` that considers the frequency of the datetime data. This will give a reasonable guess as to what the frequency could be. For example, if the `PolynomialDecomposer` object is fit with\n",
"`period` not explicitly set, it will take on a default value of 7, which is good for daily data signals that\n",
"have a known seasonal component period that is weekly.\n",
"\n",
"In this case where the seasonal period is not known beforehand, the `set_period()` convenience function will \n",
"look at the target data, determine a better guess for the period and set the `Decomposer` object appropriately."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdc = PolynomialDecomposer()\n",
"pdc.fit(X_train_time, y_train_time)\n",
"assert pdc.period == 7\n",
"pdc.set_period(X_train_time, y_train_time)\n",
"assert 350 < pdc.period < 370"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### STLDecomposer\n",
"\n",
"The `STLDecomposer` runs on [statsmodels' implementation](https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.STL.html) of [STL decomposition](https://otexts.com/fpp3/stl.html). Let's take a look at how STL decomposes the weather dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evalml.pipelines.components import STLDecomposer\n",
"\n",
"stl = STLDecomposer()\n",
"X_t, y_t = stl.fit_transform(X_train_time, y_train_time)\n",
"\n",
"res = stl.plot_decomposition(X_train_time, y_train_time)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This doesn't look nearly as good as the `PolynomialDecomposer` did. This is because STL decomposition performs best when the data has a small seasonal period, generally less than 14 time units. The weather dataset's seasonal period of ~365 days does not work as well since STL extracted a shorter term seasonality for decomposition. \n",
"\n",
"We can generate some synthetic data that better highlights where STL performs well. For this example, we'll generate monthly data with an annual seasonal period."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"import numpy as np\n",
"from datetime import datetime\n",
"from sklearn.preprocessing import minmax_scale\n",
"\n",
"\n",
"def generate_synthetic_data(\n",
" period=12,\n",
" num_periods=25,\n",
" scale=10,\n",
" seasonal_scale=2,\n",
" trend_degree=1,\n",
" freq_str=\"M\",\n",
"):\n",
" freq = 2 * np.pi / period\n",
" x = np.arange(0, period * num_periods, 1)\n",
" dts = pd.date_range(datetime.today(), periods=len(x), freq=freq_str)\n",
" X = pd.DataFrame({\"x\": x})\n",
" X = X.set_index(dts)\n",
"\n",
" if trend_degree == 1:\n",
" y_trend = pd.Series(scale * minmax_scale(x + 2))\n",
" elif trend_degree == 2:\n",
" y_trend = pd.Series(scale * minmax_scale(x**2))\n",
" elif trend_degree == 3:\n",
" y_trend = pd.Series(scale * minmax_scale((x - 5) ** 3 + x**2))\n",
" y_seasonal = pd.Series(seasonal_scale * np.sin(freq * x))\n",
" y_random = pd.Series(np.random.normal(0, 1, len(X)))\n",
" y = y_trend + y_seasonal + y_random\n",
" return X, y\n",
"\n",
"\n",
"X_stl, y_stl = generate_synthetic_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how the `STLDecomposer` does at decomposing this data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stl = STLDecomposer()\n",
"X_t_stl, y_t_stl = stl.fit_transform(X_stl, y_stl)\n",
"\n",
"res = stl.plot_decomposition(X_stl, y_stl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On top of decomposing this type of data well, the statsmodels implementation of STL automatically determines the seasonal period of the data, which is saved during fit time for this component."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stl = STLDecomposer()\n",
"assert stl.period == None\n",
"stl.fit(X_stl, y_stl)\n",
"print(stl.period)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running AutoMLSearch\n",
"\n",
"`AutoMLSearch` for time series problems works very similarly to the other problem types with the exception that users need to pass in a new parameter called `problem_configuration`.\n",
"\n",
"The `problem_configuration` is a dictionary specifying the following values:\n",
"\n",
"* **forecast_horizon**: The number of time periods we are trying to forecast. In this example, we're interested in predicting weather for the next 7 days, so the value is 7.\n",
"\n",
"* **gap**: The number of time periods between the end of the training set and the start of the test set. For example, in our case we are interested in predicting the weather for the next 7 days with the data as it is \"today\", so the gap is 0. However, if we had to predict the weather for next Monday-Sunday with the data as it was on the previous Friday, the gap would be 2 (Saturday and Sunday separate Monday from Friday). It is important to select a value that matches the realistic delay between the forecast date and the most recently avaliable data that can be used to make that forecast.\n",
"\n",
"* **max_delay**: The maximum number of rows to look in the past from the current row in order to compute features. In our example, we'll say we can use the previous week's weather to predict the current week's.\n",
"\n",
"* **time_index**: The column of the training dataset that contains the date corresponding to each observation. While only some of the models we run during time series searches require the `time_index`, we require it to be passed in to top level search so that the parameter can reach the models that need it.\n",
"\n",
"Note that the values of these parameters must be in the same units as the training/testing data.\n",
"\n",
"### Visualization of forecast horizon and gap\n",
"![forecast and gap](ts_viz/ts_parameter_viz.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evalml.automl import AutoMLSearch\n",
"\n",
"problem_config = {\"gap\": 0, \"max_delay\": 7, \"forecast_horizon\": 7, \"time_index\": \"Date\"}\n",
"\n",
"automl = AutoMLSearch(\n",
" X_train,\n",
" y_train,\n",
" problem_type=\"time series regression\",\n",
" max_batches=1,\n",
" problem_configuration=problem_config,\n",
" automl_algorithm=\"iterative\",\n",
" allowed_model_families=[\n",
" \"xgboost\",\n",
" \"random_forest\",\n",
" \"linear_model\",\n",
" \"extra_trees\",\n",
" ],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl.search()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding what happened under the hood\n",
"\n",
"This is great, ``AutoMLSearch`` is able to find a pipeline that scores an R2 value of 0.44 compared to a baseline pipeline that is only able to score 0.07. But how did it do that? \n",
"\n",
"### Data Splitting\n",
"\n",
"EvalML uses [rolling origin cross validation](https://robjhyndman.com/hyndsight/tscv/) for time series problems. Basically, we take successive cuts of the training data while keeping the validation set size fixed at `forecast_horizon` number of time units. Note that the splits are not separated by ``gap`` number of units. This is because we need access to all the data to generate features for every row of the validation set. However, the feature engineering done by our pipelines respects the ``gap`` value. This is explained more in the [feature engineering section](#Feature-Engineering).\n",
"\n",
"![cross validation](ts_viz/cv_viz.png)\n",
"\n",
"### Baseline Pipeline\n",
"\n",
"The most naive thing we can do in a time series problem is use the most recently available observation to predict the next observation. In our example, this means we'll use the measurement from ``7`` days ago as the prediction for the current date."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"baseline = automl.get_pipeline(0)\n",
"baseline.fit(X_train, y_train)\n",
"naive_baseline_preds = baseline.predict_in_sample(\n",
" X_test, y_test, objective=None, X_train=X_train, y_train=y_train\n",
")\n",
"expected_preds = pd.Series(\n",
" pd.concat([y_train.iloc[-7:], y_test]).shift(7).iloc[7:], name=\"target\"\n",
")\n",
"pd.testing.assert_series_equal(expected_preds, naive_baseline_preds)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Engineering\n",
"\n",
"EvalML uses the values of `gap`, `forecast_horizon`, and `max_delay` to calculate a \"window\" of allowed dates that can be used for engineering the features of each row in the validation/test set. The formula for computing the bounds of the window is:\n",
"\n",
"
\n",
"