target_distribution_data_check#
Data check that checks if the target data contains certain distributions that may need to be transformed prior training to improve model performance.
Module Contents#
Classes Summary#
Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera. |
Contents#
- class evalml.data_checks.target_distribution_data_check.TargetDistributionDataCheck[source]#
Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera.
Methods
Return a name describing the data check.
Check if the target data has a certain distribution.
- name(cls)#
Return a name describing the data check.
- validate(self, X, y)[source]#
Check if the target data has a certain distribution.
- Parameters
X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target data to check for underlying distributions.
- Returns
List with DataCheckErrors if certain distributions are found in the target data.
- Return type
dict (DataCheckError)
Examples
>>> import pandas as pd
Targets that exhibit a lognormal distribution will raise a warning for the user to transform the target.
>>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897] >>> target_check = TargetDistributionDataCheck() >>> assert target_check.validate(None, y) == [ ... { ... "message": "Target may have a lognormal distribution.", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "warning", ... "code": "TARGET_LOGNORMAL_DISTRIBUTION", ... "details": {"normalization_method": "shapiro", "statistic": 0.8, "p-value": 0.045, "columns": None, "rows": None}, ... "action_options": [ ... { ... "code": "TRANSFORM_TARGET", ... "data_check_name": "TargetDistributionDataCheck", ... "parameters": {}, ... "metadata": { ... "transformation_strategy": "lognormal", ... "is_target": True, ... "columns": None, ... "rows": None ... } ... } ... ] ... } ... ] ... >>> y = pd.Series([1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5]) >>> assert target_check.validate(None, y) == [] ... ... >>> y = pd.Series(pd.date_range("1/1/21", periods=10)) >>> assert target_check.validate(None, y) == [ ... { ... "message": "Target is unsupported datetime type. Valid Woodwork logical types include: integer, double, age, age_fractional", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "unsupported_type": "datetime"}, ... "code": "TARGET_UNSUPPORTED_TYPE", ... "action_options": [] ... } ... ]