datetime_format_data_check#

Data check that checks if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.

Module Contents#

Classes Summary#

DateTimeFormatDataCheck

Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.

Contents#

class evalml.data_checks.datetime_format_data_check.DateTimeFormatDataCheck(datetime_column='index', nan_duplicate_threshold=0.75)[source]#

Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.

Parameters
  • datetime_column (str, int) – The name of the datetime column. If the datetime values are in the index, then pass “index”.

  • nan_duplicate_threshold (float) – The percentage of values in the datetime_column that must not be duplicate or nan before DATETIME_NO_FREQUENCY_INFERRED is returned instead of DATETIME_HAS_UNEVEN_INTERVALS. For example, if this is set to 0.80, then only 20% of the values in datetime_column can be duplicate or nan. Defaults to 0.75.

Methods

name

Return a name describing the data check.

validate

Checks if the target data has equal intervals and is monotonically increasing.

name(cls)#

Return a name describing the data check.

validate(self, X, y)[source]#

Checks if the target data has equal intervals and is monotonically increasing.

Will return a DataCheckError if the data is not a datetime type, is not increasing, has redundant or missing row(s), contains invalid (NaN or None) values, or has values that don’t align with the assumed frequency.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Target data.

Returns

List with DataCheckErrors if unequal intervals are found in the datetime column.

Return type

dict (DataCheckError)

Examples

>>> import pandas as pd

The column ‘dates’ has a set of two dates with daily frequency, two dates with hourly frequency, and two dates with monthly frequency.

>>> X = pd.DataFrame(pd.date_range("2015-01-01", periods=2).append(pd.date_range("2015-01-08", periods=2, freq="H").append(pd.date_range("2016-03-02", periods=2, freq="M"))), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "No frequency could be detected in column 'dates', possibly due to uneven intervals or too many duplicate/missing values.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_NO_FREQUENCY_INFERRED",
...         "details": {"columns": None, "rows": None},
...         "action_options": []
...      }
... ]

The column “dates” has a gap in the values, which implies there are many dates missing.

>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-31", periods=50)), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0])
>>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'dates' has datetime values missing between start and end date.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_IS_MISSING_VALUES",
...         "details": {"columns": None, "rows": None},
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'dates',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

The column “dates” has a repeat of the date 2021-01-09 appended to the end, which is considered redundant and will raise an error.

>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-09", periods=1)), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0])
>>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'dates' has more than one row with the same datetime value.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_REDUNDANT_ROW",
...         "details": {"columns": None, "rows": None},
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'dates',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

The column “Weeks” has a date that does not follow the weekly pattern, which is considered misaligned.

>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=12).append(pd.date_range("2021-03-22", periods=1)), columns=["Weeks"])
>>> ww_payload = infer_frequency(X["Weeks"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'Weeks' has datetime values that do not align with the inferred frequency.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_HAS_MISALIGNED_VALUES",
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'Weeks', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'Weeks',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

The column “Weeks” has a date that does not follow the weekly pattern, which is considered misaligned.

>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=12).append(pd.date_range("2021-03-22", periods=1)), columns=["Weeks"])
>>> ww_payload = infer_frequency(X["Weeks"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'Weeks' has datetime values that do not align with the inferred frequency.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_HAS_MISALIGNED_VALUES",
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'Weeks', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'Weeks',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

The column “Weeks” passed integers instead of datetime data, which will raise an error.

>>> X = pd.DataFrame([1, 2, 3, 4], columns=["Weeks"])
>>> y = pd.Series([0] * 4)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Datetime information could not be found in the data, or was not in a supported datetime format.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_INFORMATION_NOT_FOUND",
...         "action_options": []
...      }
... ]

Converting that same integer data to datetime, however, is valid.

>>> X = pd.DataFrame(pd.to_datetime([1, 2, 3, 4]), columns=["Weeks"])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == []
>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=10), columns=["Weeks"])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == []

While the data passed in is of datetime type, time series requires the datetime information in datetime_column to be monotonically increasing (ascending).

>>> X = X.iloc[::-1]
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Datetime values must be sorted in ascending order.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_IS_NOT_MONOTONIC",
...         "action_options": []
...      }
... ]

The first value in the column “index” is replaced with NaT, which will raise an error in this data check.

>>> dates = [["2-1-21", "3-1-21"],
...         ["2-2-21", "3-2-21"],
...         ["2-3-21", "3-3-21"],
...         ["2-4-21", "3-4-21"],
...         ["2-5-21", "3-5-21"],
...         ["2-6-21", "3-6-21"],
...         ["2-7-21", "3-7-21"],
...         ["2-8-21", "3-8-21"],
...         ["2-9-21", "3-9-21"],
...         ["2-10-21", "3-10-21"],
...         ["2-11-21", "3-11-21"],
...         ["2-12-21", "3-12-21"]]
>>> dates[0][0] = None
>>> df = pd.DataFrame(dates, columns=["days", "days2"])
>>> ww_payload = infer_frequency(pd.to_datetime(df["days"]), debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="days")
>>> assert datetime_format_dc.validate(df, y) == [
...     {
...         "message": "Input datetime column 'days' contains NaN values. Please impute NaN values or drop these rows.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_HAS_NAN",
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'days', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'days',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]
...