id_columns_data_check¶

Data check that checks if any of the features are likely to be ID columns.

Module Contents¶

Classes Summary¶

IDColumnsDataCheck

Check if any of the features are likely to be ID columns.

Contents¶

class evalml.data_checks.id_columns_data_check.IDColumnsDataCheck(id_threshold=1.0)[source]¶

Check if any of the features are likely to be ID columns.

Parameters: id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

Checks performed are:

column name is “id”

column name ends in “_id”

column contains all unique values (and is categorical / integer type)

Parameters

X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series) – The target. Defaults to None. Ignored.

Returns

A dictionary of features with column name or index and their probability of being ID columns

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'df_id': [0, 1, 2, 3, 4],
...     'x': [10, 42, 31, 51, 61],
...     'y': [42, 54, 12, 64, 12]
... })
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Column 'df_id' is 100.0% or more likely to be an ID column",
...                   "data_check_name": "IDColumnsDataCheck",
...                   "level": "warning",
...                   "code": "HAS_ID_COLUMN",
...                   "details": {"column": "df_id"}}],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"column": "df_id"}}]}

highly_null_data_check invalid_targets_data_check