target_distribution_data_check¶
Data check that checks if the target data contains certain distributions that may need to be transformed prior training to improve model performance.
Module Contents¶
Classes Summary¶
Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera. |
Contents¶
-
class
evalml.data_checks.target_distribution_data_check.
TargetDistributionDataCheck
[source]¶ Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera.
Methods
Return a name describing the data check.
Check if the target data has a certain distribution.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y)[source]¶ Check if the target data has a certain distribution.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for underlying distributions.
- Returns
List with DataCheckErrors if certain distributions are found in the target data.
- Return type
dict (DataCheckError)
Examples
>>> import pandas as pd
Targets that exhibit a lognormal distribution will raise a warning for the user to transform the target.
>>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897] >>> target_check = TargetDistributionDataCheck() >>> assert target_check.validate(None, y) == [ ... { ... "message": "Target may have a lognormal distribution.", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "warning", ... "code": "TARGET_LOGNORMAL_DISTRIBUTION", ... "details": {"normalization_method": "shapiro", "statistic": 0.8, "p-value": 0.045, "columns": None, "rows": None}, ... "action_options": [ ... { ... "code": "TRANSFORM_TARGET", ... "data_check_name": "TargetDistributionDataCheck", ... "parameters": {}, ... "metadata": { ... "transformation_strategy": "lognormal", ... "is_target": True, ... "columns": None, ... "rows": None ... } ... } ... ] ... } ... ] ... >>> y = pd.Series([1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5]) >>> assert target_check.validate(None, y) == [] ... ... >>> y = pd.Series(pd.date_range("1/1/21", periods=10)) >>> assert target_check.validate(None, y) == [ ... { ... "message": "Target is unsupported datetime type. Valid Woodwork logical types include: integer, double", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "unsupported_type": "datetime"}, ... "code": "TARGET_UNSUPPORTED_TYPE", ... "action_options": [] ... } ... ]
-