id_columns_data_check¶
Module Contents¶
Classes Summary¶
Check if any of the features are likely to be ID columns. |
Contents¶
-
class
evalml.data_checks.id_columns_data_check.
IDColumnsDataCheck
(id_threshold=1.0)[source]¶ Check if any of the features are likely to be ID columns.
- Parameters
id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.
Methods
Returns a name describing the data check.
Check if any of the features are likely to be ID columns. Currently performs these simple checks:
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any of the features are likely to be ID columns. Currently performs these simple checks:
column name is “id”
column name ends in “_id”
column contains all unique values (and is categorical / integer type)
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check
- Returns
A dictionary of features with column name or index and their probability of being ID columns
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'df_id': [0, 1, 2, 3, 4], ... 'x': [10, 42, 31, 51, 61], ... 'y': [42, 54, 12, 64, 12] ... }) >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == {"errors": [], "warnings": [{"message": "Column 'df_id' is 100.0% or more likely to be an ID column", "data_check_name": "IDColumnsDataCheck", "level": "warning", "code": "HAS_ID_COLUMN", "details": {"column": "df_id"}}], "actions": [{"code": "DROP_COL", "metadata": {"column": "df_id"}}]}