Data Checks¶
Submodules¶
- class_imbalance_data_check
- data_check
- data_check_action
- data_check_action_code
- data_check_message
- data_check_message_code
- data_check_message_type
- data_checks
- datetime_format_data_check
- datetime_nan_data_check
- default_data_checks
- highly_null_data_check
- id_columns_data_check
- invalid_targets_data_check
- multicollinearity_data_check
- natural_language_nan_data_check
- no_variance_data_check
- outliers_data_check
- sparsity_data_check
- target_distribution_data_check
- target_leakage_data_check
- uniqueness_data_check
- utils
Package Contents¶
Classes Summary¶
Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems. |
|
Base class for all data checks. Data checks are a set of heuristics used to determine if there are problems with input data. |
|
A recommended action returned by a DataCheck. |
|
Enum for data check action code. |
|
DataCheckMessage subclass for errors returned by data checks. |
|
Base class for a message returned by a DataCheck, tagged by name. |
|
Enum for data check message code. |
|
Enum for type of data check message: WARNING or ERROR. |
|
A collection of data checks. |
|
DataCheckMessage subclass for warnings returned by data checks. |
|
Checks if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order |
|
Checks each column in the input for datetime features and will issue an error if NaN values are present. |
|
A collection of basic data checks that is used by AutoML by default. |
|
A collection of data checks. |
|
Checks if there are any highly-null columns and rows in the input. |
|
Check if any of the features are likely to be ID columns. |
|
Checks if the target data contains missing or invalid values. |
|
Check if any set features are likely to be multicollinear. |
|
Checks each column in the input for natural language features and will issue an error if NaN values are present. |
|
Check if the target or any of the features have no variance. |
|
Checks if there are any outliers in input data by using IQR to determine score anomalies. Columns with score anomalies are considered to contain outliers. |
|
Checks if there are any columns with sparsely populated values in the input. |
|
Checks if the target data contains certain distributions that may need to be transformed prior training to |
|
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation. |
|
Checks if there are any columns in the input that are either too unique for classification problems |
Contents¶
-
class
evalml.data_checks.
ClassImbalanceDataCheck
(threshold=0.1, min_samples=100, num_cv_folds=3)[source]¶ Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.
- Parameters
threshold (float) – The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10.
min_samples (int) – The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100.
num_cv_folds (int) – The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3.
Methods
Returns a name describing the data check.
Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ - Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems
Ignores NaN values in target labels if they appear.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target labels to check for imbalanced data.
- Returns
- Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,
and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.
- Return type
dict
Example
>>> import pandas as pd >>> X = pd.DataFrame() >>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> target_check = ClassImbalanceDataCheck(threshold=0.10) >>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]", "data_check_name": "ClassImbalanceDataCheck", "level": "error", "code": "CLASS_IMBALANCE_BELOW_FOLDS", "details": {"target_values": [0]}}], "warnings": [{"message": "The following labels fall below 10% of the target: [0]", "data_check_name": "ClassImbalanceDataCheck", "level": "warning", "code": "CLASS_IMBALANCE_BELOW_THRESHOLD", "details": {"target_values": [0]}}, {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]", "data_check_name": "ClassImbalanceDataCheck", "level": "warning", "code": "CLASS_IMBALANCE_SEVERE", "details": {"target_values": [0]}}], "actions": []}
-
class
evalml.data_checks.
DataCheck
[source]¶ Base class for all data checks. Data checks are a set of heuristics used to determine if there are problems with input data.
Methods
Returns a name describing the data check.
Inspects and validates the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.
-
name
(cls)¶ Returns a name describing the data check.
-
abstract
validate
(self, X, y=None)[source]¶ Inspects and validates the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame) – The input data of shape [n_samples, n_features]
y (pd.Series, optional) – The target data of length [n_samples]
- Returns
Dictionary of DataCheckError and DataCheckWarning messages
- Return type
dict (DataCheckMessage)
-
-
class
evalml.data_checks.
DataCheckAction
(action_code, metadata=None)[source]¶ A recommended action returned by a DataCheck.
- Parameters
action_code (DataCheckActionCode) – Action code associated with the action.
metadata (dict, optional) – Additional useful information associated with the action. Defaults to None.
Methods
-
class
evalml.data_checks.
DataCheckActionCode
[source]¶ Enum for data check action code.
Attributes
DROP_COL
Action code for dropping a column.
DROP_ROWS
Action code for dropping rows.
IMPUTE_COL
Action code for imputing a column.
TRANSFORM_TARGET
Action code for transforming the target data.
Methods
The name of the Enum member.
The value of the Enum member.
-
name
(self)¶ The name of the Enum member.
-
value
(self)¶ The value of the Enum member.
-
-
class
evalml.data_checks.
DataCheckError
(message, data_check_name, message_code=None, details=None)[source]¶ DataCheckMessage subclass for errors returned by data checks.
Attributes
message_type
DataCheckMessageType.ERROR
Methods
-
to_dict
(self)¶
-
-
class
evalml.data_checks.
DataCheckMessage
(message, data_check_name, message_code=None, details=None)[source]¶ Base class for a message returned by a DataCheck, tagged by name.
- Parameters
message (str) – Message string
data_check_name (str) – Name of data check
message_code (DataCheckMessageCode) – Message code associated with message. Defaults to None.
details (dict) – Additional useful information associated with the message. Defaults to None.
Attributes
message_type
None
Methods
-
class
evalml.data_checks.
DataCheckMessageCode
[source]¶ Enum for data check message code.
Attributes
CLASS_IMBALANCE_BELOW_FOLDS
Message code for when the number of values for each target is below 2 * number of CV folds.
CLASS_IMBALANCE_BELOW_THRESHOLD
Message code for when balance in classes is less than the threshold.
CLASS_IMBALANCE_SEVERE
Message code for when balance in classes is less than the threshold and minimum class is less than minimum number of accepted samples.
DATETIME_HAS_NAN
Message code for when input datetime columns contain NaN values.
DATETIME_HAS_UNEVEN_INTERVALS
Message code for when the datetime values have uneven intervals.
DATETIME_INFORMATION_NOT_FOUND
Message code for when datetime information can not be found or is in an unaccepted format.
DATETIME_IS_NOT_MONOTONIC
Message code for when the datetime values are not monotonically increasing.
HAS_ID_COLUMN
Message code for data that has ID columns.
HAS_OUTLIERS
Message code for when outliers are detected.
HIGH_VARIANCE
Message code for when high variance is detected for cross-validation.
HIGHLY_NULL_COLS
Message code for highly null columns.
HIGHLY_NULL_ROWS
Message code for highly null rows.
IS_MULTICOLLINEAR
Message code for when data is potentially multicollinear.
MISMATCHED_INDICES
Message code for when input target and features have mismatched indices.
MISMATCHED_INDICES_ORDER
Message code for when input target and features have mismatched indices order. The two inputs have the same index values, but shuffled.
MISMATCHED_LENGTHS
Message code for when input target and features have different lengths.
NATURAL_LANGUAGE_HAS_NAN
Message code for when input natural language columns contain NaN values.
NO_VARIANCE
Message code for when data has no variance (1 unique value).
NO_VARIANCE_WITH_NULL
Message code for when data has one unique value and NaN values.
NOT_UNIQUE_ENOUGH
Message code for when data does not possess enough unique values.
TARGET_BINARY_NOT_TWO_UNIQUE_VALUES
Message code for target data for a binary classification problem that does not have two unique values.
TARGET_HAS_NULL
Message code for target data that has null values.
TARGET_INCOMPATIBLE_OBJECTIVE
Message code for target data that has incompatible values for the specified objective
TARGET_IS_EMPTY_OR_FULLY_NULL
Message code for target data that is empty or has all null values.
TARGET_IS_NONE
Message code for when target is None.
TARGET_LEAKAGE
Message code for when target leakage is detected.
TARGET_LOGNORMAL_DISTRIBUTION
Message code for target data with a lognormal distribution.
TARGET_MULTICLASS_HIGH_UNIQUE_CLASS
Message code for target data for a multi classification problem that has an abnormally large number of unique classes relative to the number of target values.
TARGET_MULTICLASS_NOT_ENOUGH_CLASSES
Message code for target data for a multi classification problem that does not have more than two unique classes.
TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS
Message code for target data for a multi classification problem that does not have two examples per class.
TARGET_UNSUPPORTED_PROBLEM_TYPE
Message code for target data that is being checked against an unsupported problem type.
TARGET_UNSUPPORTED_TYPE
Message code for target data that is of an unsupported type.
TOO_SPARSE
Message code for when multiclass data has values that are too sparsely populated.
TOO_UNIQUE
Message code for when data possesses too many unique values.
Methods
The name of the Enum member.
The value of the Enum member.
-
name
(self)¶ The name of the Enum member.
-
value
(self)¶ The value of the Enum member.
-
-
class
evalml.data_checks.
DataCheckMessageType
[source]¶ Enum for type of data check message: WARNING or ERROR.
Attributes
ERROR
Error message returned by a data check.
WARNING
Warning message returned by a data check.
Methods
The name of the Enum member.
The value of the Enum member.
-
name
(self)¶ The name of the Enum member.
-
value
(self)¶ The value of the Enum member.
-
-
class
evalml.data_checks.
DataChecks
(data_checks=None, data_check_params=None)[source]¶ A collection of data checks.
Methods
Inspects and validates the input data against data checks and returns a list of warnings and errors if applicable.
-
validate
(self, X, y=None)[source]¶ Inspects and validates the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict
-
-
class
evalml.data_checks.
DataCheckWarning
(message, data_check_name, message_code=None, details=None)[source]¶ DataCheckMessage subclass for warnings returned by data checks.
Attributes
message_type
DataCheckMessageType.WARNING
Methods
-
to_dict
(self)¶
-
-
class
evalml.data_checks.
DateTimeFormatDataCheck
(datetime_column='index')[source]¶ Checks if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.
- Parameters
datetime_column (str, int) – The name of the datetime column. If the datetime values are in the index, then pass “index”.
Methods
Returns a name describing the data check.
Checks if the target data has equal intervals and is sorted.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ Checks if the target data has equal intervals and is sorted.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Target data.
- Returns
List with DataCheckErrors if unequal intervals are found in the datetime column.
- Return type
dict (DataCheckError)
Example
>>> from pandas as pd >>> X = pd.DataFrame(pd.date_range("January 1, 2021", periods=8), columns=["dates"]) >>> y = pd.Series([1, 2, 4, 2, 1, 2, 3, 1]) >>> X.iloc[7] = "January 9, 2021" >>> datetime_format_check = DateTimeFormatDataCheck() >>> assert datetime_format_check.validate(X, y) == {"errors": [{"message": "No frequency could be detected in dates, possibly due to uneven intervals.", "data_check_name": "EqualIntervalDataCheck", "level": "error", "code": "DATETIME_HAS_UNEVEN_INTERVALS", "details": {}}], "warnings": [], "actions": []}
-
class
evalml.data_checks.
DateTimeNaNDataCheck
[source]¶ Checks each column in the input for datetime features and will issue an error if NaN values are present.
Methods
Returns a name describing the data check.
Checks if any datetime columns contain NaN values.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Checks if any datetime columns contain NaN values.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckError if NaN values are present in datetime columns.
- Return type
dict
Example
>>> import pandas as pd >>> import woodwork as ww >>> import numpy as np >>> dates = np.arange(np.datetime64('2017-01-01'), np.datetime64('2017-01-08')) >>> dates[0] = np.datetime64('NaT') >>> df = pd.DataFrame(dates, columns=['index']) >>> df.ww.init() >>> dt_nan_check = DateTimeNaNDataCheck() >>> assert dt_nan_check.validate(df) == {"warnings": [], ... "actions": [], ... "errors": [DataCheckError(message='Input datetime column(s) (index) contains NaN values. Please impute NaN values or drop these rows or columns.', ... data_check_name=DateTimeNaNDataCheck.name, ... message_code=DataCheckMessageCode.DATETIME_HAS_NAN, ... details={"columns": 'index'}).to_dict()]}
-
-
class
evalml.data_checks.
DefaultDataChecks
(problem_type, objective, n_splits=3, datetime_column=None)[source]¶ A collection of basic data checks that is used by AutoML by default. Includes:
HighlyNullDataCheck
HighlyNullRowsDataCheck
IDColumnsDataCheck
TargetLeakageDataCheck
InvalidTargetDataCheck
NoVarianceDataCheck
ClassImbalanceDataCheck (for classification problem types)
DateTimeNaNDataCheck
NaturalLanguageNaNDataCheck
TargetDistributionDataCheck (for regression problem types)
DateTimeFormatDataCheck (for time series problem types)
- Parameters
problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.
datetime_column (str) – The name of the column containing datetime information to be used for time series problems.
to "index" indicating that the datetime information is in the index of X or y. (Default) –
Methods
Inspects and validates the input data against data checks and returns a list of warnings and errors if applicable.
-
validate
(self, X, y=None)¶ Inspects and validates the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict
-
class
evalml.data_checks.
EmptyDataChecks
(data_checks=None)[source]¶ A collection of data checks.
Methods
Inspects and validates the input data against data checks and returns a list of warnings and errors if applicable.
-
validate
(self, X, y=None)¶ Inspects and validates the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict
-
-
class
evalml.data_checks.
HighlyNullDataCheck
(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]¶ Checks if there are any highly-null columns and rows in the input.
- Parameters
pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.
pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.
Methods
Returns a name describing the data check.
Checks if there are any highly-null columns or rows in the input.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Checks if there are any highly-null columns or rows in the input.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.
- Returns
dict with a DataCheckWarning if there are any highly-null columns or rows.
- Return type
dict
Example
>>> import pandas as pd >>> class SeriesWrap(): ... def __init__(self, series): ... self.series = series ... ... def __eq__(self, series_2): ... return all(self.series.eq(series_2.series)) ... >>> df = pd.DataFrame({ ... 'lots_of_null': [None, None, None, None, 5], ... 'no_null': [1, 2, 3, 4, 5] ... }) >>> null_check = HighlyNullDataCheck(pct_null_col_threshold=0.50, pct_null_row_threshold=0.50) >>> validation_results = null_check.validate(df) >>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols']) >>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.5, 0.5])) >>> assert validation_results== {"errors": [], "warnings": [{"message": "4 out of 5 rows are more than 50.0% null", "data_check_name": "HighlyNullDataCheck", "level": "warning", "code": "HIGHLY_NULL_ROWS", "details": {"pct_null_cols": highly_null_rows}}, {"message": "Column 'lots_of_null' is 50.0% or more null", "data_check_name": "HighlyNullDataCheck", "level": "warning", "code": "HIGHLY_NULL_COLS", "details": {"column": "lots_of_null", "pct_null_rows": 0.8}}], "actions": [{"code": "DROP_ROWS", "metadata": {"rows": [0, 1, 2, 3]}}, {"code": "DROP_COL", "metadata": {"column": "lots_of_null"}}]}
-
class
evalml.data_checks.
IDColumnsDataCheck
(id_threshold=1.0)[source]¶ Check if any of the features are likely to be ID columns.
- Parameters
id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.
Methods
Returns a name describing the data check.
Check if any of the features are likely to be ID columns. Currently performs these simple checks:
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any of the features are likely to be ID columns. Currently performs these simple checks:
column name is “id”
column name ends in “_id”
column contains all unique values (and is categorical / integer type)
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check
- Returns
A dictionary of features with column name or index and their probability of being ID columns
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'df_id': [0, 1, 2, 3, 4], ... 'x': [10, 42, 31, 51, 61], ... 'y': [42, 54, 12, 64, 12] ... }) >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == {"errors": [], "warnings": [{"message": "Column 'df_id' is 100.0% or more likely to be an ID column", "data_check_name": "IDColumnsDataCheck", "level": "warning", "code": "HAS_ID_COLUMN", "details": {"column": "df_id"}}], "actions": [{"code": "DROP_COL", "metadata": {"column": "df_id"}}]}
-
class
evalml.data_checks.
InvalidTargetDataCheck
(problem_type, objective, n_unique=100)[source]¶ Checks if the target data contains missing or invalid values.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_unique (int) – Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100.
Attributes
multiclass_continuous_threshold
0.05
Methods
Returns a name describing the data check.
Checks if the target data contains missing or invalid values.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ Checks if the target data contains missing or invalid values.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for invalid values.
- Returns
List with DataCheckErrors if any invalid values are found in the target data.
- Return type
dict (DataCheckError)
Example
>>> import pandas as pd >>> X = pd.DataFrame({"col": [1, 2, 3, 1]}) >>> y = pd.Series([0, 1, None, None]) >>> target_check = InvalidTargetDataCheck('binary', 'Log Loss Binary') >>> assert target_check.validate(X, y) == {"errors": [{"message": "2 row(s) (50.0%) of target values are null", "data_check_name": "InvalidTargetDataCheck", "level": "error", "code": "TARGET_HAS_NULL", "details": {"num_null_rows": 2, "pct_null_rows": 50}}], "warnings": [], "actions": [{'code': 'IMPUTE_COL', 'metadata': {'column': None, 'impute_strategy': 'most_frequent', 'is_target': True}}]}
-
class
evalml.data_checks.
MulticollinearityDataCheck
(threshold=0.9)[source]¶ Check if any set features are likely to be multicollinear.
- Parameters
threshold (float) – The threshold to be considered. Defaults to 0.9.
Methods
Returns a name describing the data check.
Check if any set of features are likely to be multicollinear.
-
name
(cls)¶ Returns a name describing the data check.
-
class
evalml.data_checks.
NaturalLanguageNaNDataCheck
[source]¶ Checks each column in the input for natural language features and will issue an error if NaN values are present.
Methods
Returns a name describing the data check.
Checks if any natural language columns contain NaN values.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Checks if any natural language columns contain NaN values.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckError if NaN values are present in natural language columns.
- Return type
dict
Example
>>> import pandas as pd >>> import woodwork as ww >>> import numpy as np >>> data = pd.DataFrame() >>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"] >>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language'] >>> data['C'] = np.random.randint(0, 3, size=len(data)) >>> data.ww.init(logical_types={'A': 'NaturalLanguage', 'B': 'NaturalLanguage'}) >>> nl_nan_check = NaturalLanguageNaNDataCheck() >>> assert nl_nan_check.validate(data) == { ... "warnings": [], ... "actions": [], ... "errors": [DataCheckError(message='Input natural language column(s) (A) contains NaN values. Please impute NaN values or drop these rows or columns.', ... data_check_name=NaturalLanguageNaNDataCheck.name, ... message_code=DataCheckMessageCode.NATURAL_LANGUAGE_HAS_NAN, ... details={"columns": 'A'}).to_dict()] ... }
-
-
class
evalml.data_checks.
NoVarianceDataCheck
(count_nan_as_value=False)[source]¶ Check if the target or any of the features have no variance.
- Parameters
count_nan_as_value (bool) – If True, missing values will be counted as their own unique value. Additionally, if true, will return a DataCheckWarning instead of an error if the feature has mostly missing data and only one unique value. Defaults to False.
Methods
Returns a name describing the data check.
Check if the target or any of the features have no variance (1 unique value).
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if the target or any of the features have no variance (1 unique value).
- Parameters
X (pd.DataFrame, np.ndarray) – The input features.
y (pd.Series, np.ndarray) – The target data.
- Returns
dict of warnings/errors corresponding to features or target with no variance.
- Return type
dict
-
class
evalml.data_checks.
OutliersDataCheck
[source]¶ Checks if there are any outliers in input data by using IQR to determine score anomalies. Columns with score anomalies are considered to contain outliers.
Methods
Returns a name describing the data check.
Checks if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Checks if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.
- Parameters
X (pd.DataFrame, np.ndarray) – Features
y (pd.Series, np.ndarray) – Ignored.
- Returns
A dictionary with warnings if any columns have outliers.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'x': [1, 2, 3, 4, 5], ... 'y': [6, 7, 8, 9, 10], ... 'z': [-1, -2, -3, -1201, -4] ... }) >>> outliers_check = OutliersDataCheck() >>> assert outliers_check.validate(df) == {"warnings": [{"message": "Column(s) 'z' are likely to have outlier data.", "data_check_name": "OutliersDataCheck", "level": "warning", "code": "HAS_OUTLIERS", "details": {"columns": ["z"]}}], "errors": [], "actions": []}
-
-
class
evalml.data_checks.
SparsityDataCheck
(problem_type, threshold, unique_count_threshold=10)[source]¶ Checks if there are any columns with sparsely populated values in the input.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.
Methods
Returns a name describing the data check.
This function calculates a sparsity score for the given value counts by calculating the percentage of
Calculates what percentage of each column’s unique values exceed the count threshold and compare
-
name
(cls)¶ Returns a name describing the data check.
-
static
sparsity_score
(col, count_threshold=10)[source]¶ This function calculates a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
- Parameters
col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.
- Returns
Sparsity score, or the percentage of the unique values that exceed count_threshold.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Calculates what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.
- Returns
dict with a DataCheckWarning if there are any sparse columns.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'sparse': [float(x) for x in range(100)], ... 'not_sparse': [float(1) for x in range(100)] ... }) >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10) >>> assert sparsity_check.validate(df) == {"errors": [], "warnings": [{"message": "Input columns (sparse) for multiclass problem type are too sparse.", "data_check_name": "SparsityDataCheck", "level": "warning", "code": "TOO_SPARSE", "details": {"column": "sparse", 'sparsity_score': 0.0}}], "actions": [{"code": "DROP_COL", "metadata": {"column": "sparse"}}]}
-
class
evalml.data_checks.
TargetDistributionDataCheck
[source]¶ Checks if the target data contains certain distributions that may need to be transformed prior training to improve model performance.
Methods
Returns a name describing the data check.
Checks if the target data has a certain distribution.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ Checks if the target data has a certain distribution.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for underlying distributions.
- Returns
List with DataCheckErrors if certain distributions are found in the target data.
- Return type
dict (DataCheckError)
Example
>>> from scipy.stats import lognorm >>> X = None >>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897] >>> target_check = TargetDistributionDataCheck() >>> assert target_check.validate(X, y) == {"errors": [], "warnings": [{"message": "Target may have a lognormal distribution.", "data_check_name": "TargetDistributionDataCheck", "level": "warning", "code": "TARGET_LOGNORMAL_DISTRIBUTION", "details": {"shapiro-statistic/pvalue": '0.84/0.045'}}], "actions": [{'code': 'TRANSFORM_TARGET', 'metadata': {'column': None, 'transformation_strategy': 'lognormal', 'is_target': True}}]}
-
-
class
evalml.data_checks.
TargetLeakageDataCheck
(pct_corr_threshold=0.95, method='mutual')[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.
Methods
Returns a name describing the data check.
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
-
name
(cls)¶ Returns a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check
y (pd.Series, np.ndarray) – The target data
- Returns
dict with a DataCheckWarning if target leakage is detected.
- Return type
dict (DataCheckWarning)
Example
>>> import pandas as pd >>> X = pd.DataFrame({ ... 'leak': [10, 42, 31, 51, 61], ... 'x': [42, 54, 12, 64, 12], ... 'y': [13, 5, 13, 74, 24], ... }) >>> y = pd.Series([10, 42, 31, 51, 40]) >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == {"warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target", "data_check_name": "TargetLeakageDataCheck", "level": "warning", "code": "TARGET_LEAKAGE", "details": {"column": "leak"}}], "errors": [], "actions": [{"code": "DROP_COL", "metadata": {"column": "leak"}}]}
-
class
evalml.data_checks.
UniquenessDataCheck
(problem_type, threshold=0.5)[source]¶ Checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.
Methods
Returns a name describing the data check.
This function calculates a uniqueness score for the provided field. NaN values are
Checks if there are any columns in the input that are too unique in the case of classification
-
name
(cls)¶ Returns a name describing the data check.
-
static
uniqueness_score
(col)[source]¶ This function calculates a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
Based on the Herfindahl–Hirschman Index.
- Parameters
col (pd.Series) – Feature values.
- Returns
Uniqueness score.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Checks if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
- dict with a DataCheckWarning if there are any too unique or not
unique enough columns.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'regression_unique_enough': [float(x) for x in range(100)], ... 'regression_not_unique_enough': [float(1) for x in range(100)] ... }) >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8) >>> assert uniqueness_check.validate(df) == {"errors": [], "warnings": [{"message": "Input columns (regression_not_unique_enough) for regression problem type are not unique enough.", "data_check_name": "UniquenessDataCheck", "level": "warning", "code": "NOT_UNIQUE_ENOUGH", "details": {"column": "regression_not_unique_enough", 'uniqueness_score': 0.0}}], "actions": [{"code": "DROP_COL", "metadata": {"column": "regression_not_unique_enough"}}]}