Data Checks¶
Data checks.
Submodules¶
- class_imbalance_data_check
- data_check
- data_check_action
- data_check_action_code
- data_check_message
- data_check_message_code
- data_check_message_type
- data_checks
- datetime_format_data_check
- datetime_nan_data_check
- default_data_checks
- highly_null_data_check
- id_columns_data_check
- invalid_targets_data_check
- multicollinearity_data_check
- natural_language_nan_data_check
- no_variance_data_check
- outliers_data_check
- sparsity_data_check
- target_distribution_data_check
- target_leakage_data_check
- uniqueness_data_check
- utils
Package Contents¶
Classes Summary¶
Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems. |
|
Base class for all data checks. |
|
A recommended action returned by a DataCheck. |
|
Enum for data check action code. |
|
DataCheckMessage subclass for errors returned by data checks. |
|
Base class for a message returned by a DataCheck, tagged by name. |
|
Enum for data check message code. |
|
Enum for type of data check message: WARNING or ERROR. |
|
A collection of data checks. |
|
DataCheckMessage subclass for warnings returned by data checks. |
|
Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators. |
|
Check each column in the input for datetime features and will issue an error if NaN values are present. |
|
A collection of basic data checks that is used by AutoML by default. |
|
An empty collection of data checks. |
|
Check if there are any highly-null columns and rows in the input. |
|
Check if any of the features are likely to be ID columns. |
|
Check if the target data contains missing or invalid values. |
|
Check if any set features are likely to be multicollinear. |
|
Checks each column in the input for natural language features and will issue an error if NaN values are present. |
|
Check if the target or any of the features have no variance. |
|
Checks if there are any outliers in input data by using IQR to determine score anomalies. |
|
Check if there are any columns with sparsely populated values in the input. |
|
Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. |
|
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation. |
|
Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems. |
Contents¶
-
class
evalml.data_checks.
ClassImbalanceDataCheck
(threshold=0.1, min_samples=100, num_cv_folds=3)[source]¶ Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.
- Parameters
threshold (float) – The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10.
min_samples (int) – The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100.
num_cv_folds (int) – The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3.
Methods
Return a name describing the data check.
Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems.
Ignores NaN values in target labels if they appear.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target labels to check for imbalanced data.
- Returns
- Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,
and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.
- Return type
dict
Example
>>> import pandas as pd >>> X = pd.DataFrame() >>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> target_check = ClassImbalanceDataCheck(threshold=0.10) >>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "error", ... "code": "CLASS_IMBALANCE_BELOW_FOLDS", ... "details": {"target_values": [0]}}], ... "warnings": [{"message": "The following labels fall below 10% of the target: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_BELOW_THRESHOLD", ... "details": {"target_values": [0]}}, ... {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_SEVERE", ... "details": {"target_values": [0]}}], ... "actions": []}
-
class
evalml.data_checks.
DataCheck
[source]¶ Base class for all data checks.
Data checks are a set of heuristics used to determine if there are problems with input data.
Methods
Return a name describing the data check.
Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.
-
name
(cls)¶ Return a name describing the data check.
-
abstract
validate
(self, X, y=None)[source]¶ Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame) – The input data of shape [n_samples, n_features]
y (pd.Series, optional) – The target data of length [n_samples]
- Returns
Dictionary of DataCheckError and DataCheckWarning messages
- Return type
dict (DataCheckMessage)
-
-
class
evalml.data_checks.
DataCheckAction
(action_code, metadata=None)[source]¶ A recommended action returned by a DataCheck.
- Parameters
action_code (DataCheckActionCode) – Action code associated with the action.
metadata (dict, optional) – Additional useful information associated with the action. Defaults to None.
Methods
Return a dictionary form of the data check action.
-
class
evalml.data_checks.
DataCheckActionCode
[source]¶ Enum for data check action code.
Attributes
DROP_COL
Action code for dropping a column.
DROP_ROWS
Action code for dropping rows.
IMPUTE_COL
Action code for imputing a column.
TRANSFORM_TARGET
Action code for transforming the target data.
Methods
The name of the Enum member.
The value of the Enum member.
-
name
(self)¶ The name of the Enum member.
-
value
(self)¶ The value of the Enum member.
-
-
class
evalml.data_checks.
DataCheckError
(message, data_check_name, message_code=None, details=None)[source]¶ DataCheckMessage subclass for errors returned by data checks.
Attributes
message_type
DataCheckMessageType.ERROR
Methods
Return a dictionary form of the data check message.
-
to_dict
(self)¶ Return a dictionary form of the data check message.
-
-
class
evalml.data_checks.
DataCheckMessage
(message, data_check_name, message_code=None, details=None)[source]¶ Base class for a message returned by a DataCheck, tagged by name.
- Parameters
message (str) – Message string.
data_check_name (str) – Name of data check.
message_code (DataCheckMessageCode) – Message code associated with message. Defaults to None.
details (dict) – Additional useful information associated with the message. Defaults to None.
Attributes
message_type
None
Methods
Return a dictionary form of the data check message.
-
class
evalml.data_checks.
DataCheckMessageCode
[source]¶ Enum for data check message code.
Attributes
CLASS_IMBALANCE_BELOW_FOLDS
Message code for when the number of values for each target is below 2 * number of CV folds.
CLASS_IMBALANCE_BELOW_THRESHOLD
Message code for when balance in classes is less than the threshold.
CLASS_IMBALANCE_SEVERE
Message code for when balance in classes is less than the threshold and minimum class is less than minimum number of accepted samples.
DATETIME_HAS_NAN
Message code for when input datetime columns contain NaN values.
DATETIME_HAS_UNEVEN_INTERVALS
Message code for when the datetime values have uneven intervals.
DATETIME_INFORMATION_NOT_FOUND
Message code for when datetime information can not be found or is in an unaccepted format.
DATETIME_IS_NOT_MONOTONIC
Message code for when the datetime values are not monotonically increasing.
HAS_ID_COLUMN
Message code for data that has ID columns.
HAS_OUTLIERS
Message code for when outliers are detected.
HIGH_VARIANCE
Message code for when high variance is detected for cross-validation.
HIGHLY_NULL_COLS
Message code for highly null columns.
HIGHLY_NULL_ROWS
Message code for highly null rows.
IS_MULTICOLLINEAR
Message code for when data is potentially multicollinear.
MISMATCHED_INDICES
Message code for when input target and features have mismatched indices.
MISMATCHED_INDICES_ORDER
Message code for when input target and features have mismatched indices order. The two inputs have the same index values, but shuffled.
MISMATCHED_LENGTHS
Message code for when input target and features have different lengths.
NATURAL_LANGUAGE_HAS_NAN
Message code for when input natural language columns contain NaN values.
NO_VARIANCE
Message code for when data has no variance (1 unique value).
NO_VARIANCE_WITH_NULL
Message code for when data has one unique value and NaN values.
NOT_UNIQUE_ENOUGH
Message code for when data does not possess enough unique values.
TARGET_BINARY_NOT_TWO_UNIQUE_VALUES
Message code for target data for a binary classification problem that does not have two unique values.
TARGET_HAS_NULL
Message code for target data that has null values.
TARGET_INCOMPATIBLE_OBJECTIVE
Message code for target data that has incompatible values for the specified objective
TARGET_IS_EMPTY_OR_FULLY_NULL
Message code for target data that is empty or has all null values.
TARGET_IS_NONE
Message code for when target is None.
TARGET_LEAKAGE
Message code for when target leakage is detected.
TARGET_LOGNORMAL_DISTRIBUTION
Message code for target data with a lognormal distribution.
TARGET_MULTICLASS_HIGH_UNIQUE_CLASS
Message code for target data for a multi classification problem that has an abnormally large number of unique classes relative to the number of target values.
TARGET_MULTICLASS_NOT_ENOUGH_CLASSES
Message code for target data for a multi classification problem that does not have more than two unique classes.
TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS
Message code for target data for a multi classification problem that does not have two examples per class.
TARGET_UNSUPPORTED_PROBLEM_TYPE
Message code for target data that is being checked against an unsupported problem type.
TARGET_UNSUPPORTED_TYPE
Message code for target data that is of an unsupported type.
TOO_SPARSE
Message code for when multiclass data has values that are too sparsely populated.
TOO_UNIQUE
Message code for when data possesses too many unique values.
Methods
The name of the Enum member.
The value of the Enum member.
-
name
(self)¶ The name of the Enum member.
-
value
(self)¶ The value of the Enum member.
-
-
class
evalml.data_checks.
DataCheckMessageType
[source]¶ Enum for type of data check message: WARNING or ERROR.
Attributes
ERROR
Error message returned by a data check.
WARNING
Warning message returned by a data check.
Methods
The name of the Enum member.
The value of the Enum member.
-
name
(self)¶ The name of the Enum member.
-
value
(self)¶ The value of the Enum member.
-
-
class
evalml.data_checks.
DataChecks
(data_checks=None, data_check_params=None)[source]¶ A collection of data checks.
- Parameters
data_checks (list (DataCheck)) – List of DataCheck objects.
data_check_params (dict) – Parameters for passed DataCheck objects.
Methods
Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
-
validate
(self, X, y=None)[source]¶ Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict
-
class
evalml.data_checks.
DataCheckWarning
(message, data_check_name, message_code=None, details=None)[source]¶ DataCheckMessage subclass for warnings returned by data checks.
Attributes
message_type
DataCheckMessageType.WARNING
Methods
Return a dictionary form of the data check message.
-
to_dict
(self)¶ Return a dictionary form of the data check message.
-
-
class
evalml.data_checks.
DateTimeFormatDataCheck
(datetime_column='index')[source]¶ Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.
- Parameters
datetime_column (str, int) – The name of the datetime column. If the datetime values are in the index, then pass “index”.
Methods
Return a name describing the data check.
Checks if the target data has equal intervals and is sorted.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Checks if the target data has equal intervals and is sorted.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Target data.
- Returns
List with DataCheckErrors if unequal intervals are found in the datetime column.
- Return type
dict (DataCheckError)
Example
>>> from pandas as pd >>> X = pd.DataFrame(pd.date_range("January 1, 2021", periods=8), columns=["dates"]) >>> y = pd.Series([1, 2, 4, 2, 1, 2, 3, 1]) >>> X.iloc[7] = "January 9, 2021" >>> datetime_format_check = DateTimeFormatDataCheck() >>> assert datetime_format_check.validate(X, y) == { ... "errors": [{"message": "No frequency could be detected in dates, possibly due to uneven intervals.", ... "data_check_name": "EqualIntervalDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {}}], ... "warnings": [], ... "actions": []}
-
class
evalml.data_checks.
DateTimeNaNDataCheck
[source]¶ Check each column in the input for datetime features and will issue an error if NaN values are present.
Methods
Return a name describing the data check.
Check if any datetime columns contain NaN values.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any datetime columns contain NaN values.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckError if NaN values are present in datetime columns.
- Return type
dict
Example
>>> import pandas as pd >>> import woodwork as ww >>> import numpy as np >>> dates = np.arange(np.datetime64('2017-01-01'), np.datetime64('2017-01-08')) >>> dates[0] = np.datetime64('NaT') >>> df = pd.DataFrame(dates, columns=['index']) >>> df.ww.init() >>> dt_nan_check = DateTimeNaNDataCheck() >>> assert dt_nan_check.validate(df) == {"warnings": [], ... "actions": [], ... "errors": [DataCheckError(message='Input datetime column(s) (index) contains NaN values. Please impute NaN values or drop these rows or columns.', ... data_check_name=DateTimeNaNDataCheck.name, ... message_code=DataCheckMessageCode.DATETIME_HAS_NAN, ... details={"columns": 'index'}).to_dict()]}
-
-
class
evalml.data_checks.
DefaultDataChecks
(problem_type, objective, n_splits=3, datetime_column=None)[source]¶ A collection of basic data checks that is used by AutoML by default.
Includes:
HighlyNullDataCheck
HighlyNullRowsDataCheck
IDColumnsDataCheck
TargetLeakageDataCheck
InvalidTargetDataCheck
NoVarianceDataCheck
ClassImbalanceDataCheck (for classification problem types)
DateTimeNaNDataCheck
NaturalLanguageNaNDataCheck
TargetDistributionDataCheck (for regression problem types)
DateTimeFormatDataCheck (for time series problem types)
- Parameters
problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.
datetime_column (str) – The name of the column containing datetime information to be used for time series problems.
to "index" indicating that the datetime information is in the index of X or y. (Default) –
Methods
Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
-
validate
(self, X, y=None)¶ Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict
-
class
evalml.data_checks.
EmptyDataChecks
(data_checks=None)[source]¶ An empty collection of data checks.
- Parameters
data_checks (list (DataCheck)) – Ignored.
Methods
Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
-
validate
(self, X, y=None)¶ Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict
-
class
evalml.data_checks.
HighlyNullDataCheck
(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]¶ Check if there are any highly-null columns and rows in the input.
- Parameters
pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.
pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.
Methods
Return a name describing the data check.
Check if there are any highly-null columns or rows in the input.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if there are any highly-null columns or rows in the input.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckWarning if there are any highly-null columns or rows.
- Return type
dict
Example
>>> import pandas as pd >>> class SeriesWrap(): ... def __init__(self, series): ... self.series = series ... ... def __eq__(self, series_2): ... return all(self.series.eq(series_2.series)) ... >>> df = pd.DataFrame({ ... 'lots_of_null': [None, None, None, None, 5], ... 'no_null': [1, 2, 3, 4, 5] ... }) >>> null_check = HighlyNullDataCheck(pct_null_col_threshold=0.50, pct_null_row_threshold=0.50) >>> validation_results = null_check.validate(df) >>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols']) >>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.5, 0.5])) >>> assert validation_results == { ... "errors": [], ... "warnings": [{"message": "4 out of 5 rows are more than 50.0% null", ... "data_check_name": "HighlyNullDataCheck", ... "level": "warning", ... "code": "HIGHLY_NULL_ROWS", ... "details": {"pct_null_cols": highly_null_rows}}, ... {"message": "Column 'lots_of_null' is 50.0% or more null", ... "data_check_name": "HighlyNullDataCheck", ... "level": "warning", ... "code": "HIGHLY_NULL_COLS", ... "details": {"column": "lots_of_null", "pct_null_rows": 0.8}}], ... "actions": [{"code": "DROP_ROWS", "metadata": {"rows": [0, 1, 2, 3]}}, ... {"code": "DROP_COL", "metadata": {"column": "lots_of_null"}}]}
-
class
evalml.data_checks.
IDColumnsDataCheck
(id_threshold=1.0)[source]¶ Check if any of the features are likely to be ID columns.
- Parameters
id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.
Methods
Return a name describing the data check.
Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.
Checks performed are:
column name is “id”
column name ends in “_id”
column contains all unique values (and is categorical / integer type)
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series) – The target. Defaults to None. Ignored.
- Returns
A dictionary of features with column name or index and their probability of being ID columns
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'df_id': [0, 1, 2, 3, 4], ... 'x': [10, 42, 31, 51, 61], ... 'y': [42, 54, 12, 64, 12] ... }) >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == { ... "errors": [], ... "warnings": [{"message": "Column 'df_id' is 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"column": "df_id"}}], ... "actions": [{"code": "DROP_COL", ... "metadata": {"column": "df_id"}}]}
-
class
evalml.data_checks.
InvalidTargetDataCheck
(problem_type, objective, n_unique=100)[source]¶ Check if the target data contains missing or invalid values.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_unique (int) – Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100.
Attributes
multiclass_continuous_threshold
0.05
Methods
Return a name describing the data check.
Check if the target data contains missing or invalid values.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if the target data contains missing or invalid values.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for invalid values.
- Returns
List with DataCheckErrors if any invalid values are found in the target data.
- Return type
dict (DataCheckError)
Example
>>> import pandas as pd >>> X = pd.DataFrame({"col": [1, 2, 3, 1]}) >>> y = pd.Series([0, 1, None, None]) >>> target_check = InvalidTargetDataCheck('binary', 'Log Loss Binary') >>> assert target_check.validate(X, y) == { ... "errors": [{"message": "2 row(s) (50.0%) of target values are null", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "code": "TARGET_HAS_NULL", ... "details": {"num_null_rows": 2, "pct_null_rows": 50}}], ... "warnings": [], ... "actions": [{'code': 'IMPUTE_COL', 'metadata': {'column': None, 'impute_strategy': 'most_frequent', 'is_target': True}}]}
-
class
evalml.data_checks.
MulticollinearityDataCheck
(threshold=0.9)[source]¶ Check if any set features are likely to be multicollinear.
- Parameters
threshold (float) – The threshold to be considered. Defaults to 0.9.
Methods
Return a name describing the data check.
Check if any set of features are likely to be multicollinear.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any set of features are likely to be multicollinear.
- Parameters
X (pd.DataFrame) – The input features to check.
y (pd.Series) – The target. Ignored.
- Returns
dict with a DataCheckWarning if there are any potentially multicollinear columns.
- Return type
dict
Example
>>> import pandas as pd >>> col = pd.Series([1, 0, 2, 3, 4]) >>> X = pd.DataFrame({"col_1": col, "col_2": col * 3}) >>> y = pd.Series([1, 0, 0, 1, 0]) >>> multicollinearity_check = MulticollinearityDataCheck(threshold=0.8) >>> assert multicollinearity_check.validate(X, y) == { ... "errors": [], ... "warnings": [{'message': "Columns are likely to be correlated: [('col_1', 'col_2')]", ... "data_check_name": "MulticollinearityDataCheck", ... "level": "warning", ... "code": "IS_MULTICOLLINEAR", ... 'details': {'columns': [('col_1', 'col_2')]}}], ... "actions": []}
-
class
evalml.data_checks.
NaturalLanguageNaNDataCheck
[source]¶ Checks each column in the input for natural language features and will issue an error if NaN values are present.
Methods
Return a name describing the data check.
Check if any natural language columns contain NaN values.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any natural language columns contain NaN values.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckError if NaN values are present in natural language columns.
- Return type
dict
Example
>>> import pandas as pd >>> import woodwork as ww >>> import numpy as np >>> data = pd.DataFrame() >>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"] >>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language'] >>> data['C'] = np.random.randint(0, 3, size=len(data)) >>> data.ww.init(logical_types={'A': 'NaturalLanguage', 'B': 'NaturalLanguage'}) >>> nl_nan_check = NaturalLanguageNaNDataCheck() >>> assert nl_nan_check.validate(data) == { ... "warnings": [], ... "actions": [], ... "errors": [DataCheckError(message='Input natural language column(s) (A) contains NaN values. Please impute NaN values or drop these rows or columns.', ... data_check_name=NaturalLanguageNaNDataCheck.name, ... message_code=DataCheckMessageCode.NATURAL_LANGUAGE_HAS_NAN, ... details={"columns": 'A'}).to_dict()] ... }
-
-
class
evalml.data_checks.
NoVarianceDataCheck
(count_nan_as_value=False)[source]¶ Check if the target or any of the features have no variance.
- Parameters
count_nan_as_value (bool) – If True, missing values will be counted as their own unique value. Additionally, if true, will return a DataCheckWarning instead of an error if the feature has mostly missing data and only one unique value. Defaults to False.
Methods
Return a name describing the data check.
Check if the target or any of the features have no variance (1 unique value).
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if the target or any of the features have no variance (1 unique value).
- Parameters
X (pd.DataFrame, np.ndarray) – The input features.
y (pd.Series, np.ndarray) – The target data.
- Returns
dict of warnings/errors corresponding to features or target with no variance.
- Return type
dict
-
class
evalml.data_checks.
OutliersDataCheck
[source]¶ Checks if there are any outliers in input data by using IQR to determine score anomalies.
Columns with score anomalies are considered to contain outliers.
Methods
Return a name describing the data check.
Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers.
- Parameters
X (pd.DataFrame, np.ndarray) – Input features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
A dictionary with warnings if any columns have outliers.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'x': [1, 2, 3, 4, 5], ... 'y': [6, 7, 8, 9, 10], ... 'z': [-1, -2, -3, -1201, -4] ... }) >>> outliers_check = OutliersDataCheck() >>> assert outliers_check.validate(df) == { ... "warnings": [{"message": "Column(s) 'z' are likely to have outlier data.", ... "data_check_name": "OutliersDataCheck", ... "level": "warning", ... "code": "HAS_OUTLIERS", ... "details": {"columns": ["z"]}}], ... "errors": [], ... "actions": []}
-
-
class
evalml.data_checks.
SparsityDataCheck
(problem_type, threshold, unique_count_threshold=10)[source]¶ Check if there are any columns with sparsely populated values in the input.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.
Methods
Return a name describing the data check.
Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.
-
name
(cls)¶ Return a name describing the data check.
-
static
sparsity_score
(col, count_threshold=10)[source]¶ Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
- Parameters
col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.
- Returns
Sparsity score, or the percentage of the unique values that exceed count_threshold.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.
- Returns
dict with a DataCheckWarning if there are any sparse columns.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'sparse': [float(x) for x in range(100)], ... 'not_sparse': [float(1) for x in range(100)] ... }) >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10) >>> assert sparsity_check.validate(df) == { ... "errors": [], ... "warnings": [{"message": "Input columns (sparse) for multiclass problem type are too sparse.", ... "data_check_name": "SparsityDataCheck", ... "level": "warning", ... "code": "TOO_SPARSE", ... "details": {"column": "sparse", 'sparsity_score': 0.0}}], ... "actions": [{"code": "DROP_COL", ... "metadata": {"column": "sparse"}}]}
-
class
evalml.data_checks.
TargetDistributionDataCheck
[source]¶ Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance.
Methods
Return a name describing the data check.
Check if the target data has a certain distribution.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if the target data has a certain distribution.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for underlying distributions.
- Returns
List with DataCheckErrors if certain distributions are found in the target data.
- Return type
dict (DataCheckError)
Example
>>> from scipy.stats import lognorm >>> X = None >>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897] >>> target_check = TargetDistributionDataCheck() >>> assert target_check.validate(X, y) == { ... "errors": [], ... "warnings": [{"message": "Target may have a lognormal distribution.", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "warning", ... "code": "TARGET_LOGNORMAL_DISTRIBUTION", ... "details": {"shapiro-statistic/pvalue": '0.84/0.045'}}], ... "actions": [{'code': 'TRANSFORM_TARGET', 'metadata': {'column': None, 'transformation_strategy': 'lognormal', 'is_target': True}}]}
-
-
class
evalml.data_checks.
TargetLeakageDataCheck
(pct_corr_threshold=0.95, method='mutual')[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, this data check uses mutual information and supports all target and feature types. Otherwise, if method=’pearson’, it uses Pearson correlation and only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
pct_corr_threshold (float) – The correlation threshold to be considered leakage. Defaults to 0.95.
method (string) – The method to determine correlation. Use ‘mutual’ for mutual information, otherwise ‘pearson’ for Pearson correlation. Defaults to ‘mutual’.
Methods
Return a name describing the data check.
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
If method=’mutual’, supports all target and feature types. Otherwise, if method=’pearson’ only supports binary with numeric and boolean dtypes. Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1].
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series, np.ndarray) – The target data.
- Returns
dict with a DataCheckWarning if target leakage is detected.
- Return type
dict (DataCheckWarning)
Example
>>> import pandas as pd >>> X = pd.DataFrame({ ... 'leak': [10, 42, 31, 51, 61], ... 'x': [42, 54, 12, 64, 12], ... 'y': [13, 5, 13, 74, 24], ... }) >>> y = pd.Series([10, 42, 31, 51, 40]) >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == { ... "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "code": "TARGET_LEAKAGE", ... "details": {"column": "leak"}}], ... "errors": [], ... "actions": [{"code": "DROP_COL", ... "metadata": {"column": "leak"}}]}
-
class
evalml.data_checks.
UniquenessDataCheck
(problem_type, threshold=0.5)[source]¶ Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.
Methods
Return a name describing the data check.
Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.
-
name
(cls)¶ Return a name describing the data check.
-
static
uniqueness_score
(col)[source]¶ Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
Based on the Herfindahl–Hirschman Index.
- Parameters
col (pd.Series) – Feature values.
- Returns
Uniqueness score.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
- dict with a DataCheckWarning if there are any too unique or not
unique enough columns.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'regression_unique_enough': [float(x) for x in range(100)], ... 'regression_not_unique_enough': [float(1) for x in range(100)] ... }) >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8) >>> assert uniqueness_check.validate(df) == { ... "errors": [], ... "warnings": [{"message": "Input columns (regression_not_unique_enough) for regression problem type are not unique enough.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "code": "NOT_UNIQUE_ENOUGH", ... "details": {"column": "regression_not_unique_enough", 'uniqueness_score': 0.0}}], ... "actions": [{"code": "DROP_COL", ... "metadata": {"column": "regression_not_unique_enough"}}]}