Understanding Data Actions¶
EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is data checks, which are geared towards determining the health of the data before we train a model on it. These data checks have associated actions with them and will be shown in this notebook. In our default data checks, we have the following checks:
HighlyNullDataCheck
: Checks whether the rows or columns are highly nullIDColumnsDataCheck
: Checks for columns that could be ID columnsTargetLeakageDataCheck
: Checks if any of the input features have high association with the targetsInvalidTargetDataCheck
: Checks if there are null or other invalid values in the targetNoVarianceDataCheck
: Checks if either the target or any features have no varianceNaturalLanguageNaNDataCheck
: Checks if any natural language columns have missing dataDateTimeNaNDataCheck
: Checks if any datetime columns have missing data
EvalML has additional data checks that can be seen here, with usage examples here. Below, we will walk through usage of EvalML’s default data checks and actions.
First, we import the necessary requirements to demonstrate these checks.
[1]:
import woodwork as ww
import pandas as pd
from evalml import AutoMLSearch
from evalml.demos import load_fraud
from evalml.preprocessing import split_data
Let’s look at the input feature data. EvalML uses the Woodwork library to represent this data. The demo data that EvalML returns is a Woodwork DataTable and DataColumn.
[2]:
X, y = load_fraud(n_rows=1500)
X
Number of Features
Boolean 1
Categorical 6
Numeric 5
Number of training examples: 1500
Targets
False 86.60%
True 13.40%
Name: fraud, dtype: object
[2]:
card_id | store_id | datetime | amount | currency | customer_present | expiration_date | provider | lat | lng | region | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
0 | 32261 | 8516 | 2019-01-01 00:12:26 | 24900 | CUC | True | 08/24 | Mastercard | 38.58894 | -89.99038 | Fairview Heights | US |
1 | 16434 | 8516 | 2019-01-01 09:42:03 | 15789 | MYR | False | 11/21 | Discover | 38.58894 | -89.99038 | Fairview Heights | US |
2 | 23468 | 8516 | 2019-04-17 08:17:01 | 1883 | AUD | False | 09/27 | Discover | 38.58894 | -89.99038 | Fairview Heights | US |
3 | 14364 | 8516 | 2019-01-30 11:54:30 | 82120 | KRW | True | 09/20 | JCB 16 digit | 38.58894 | -89.99038 | Fairview Heights | US |
4 | 29407 | 8516 | 2019-05-01 17:59:36 | 25745 | MUR | True | 09/22 | American Express | 38.58894 | -89.99038 | Fairview Heights | US |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1495 | 9389 | 9389 | 2019-06-05 04:09:15 | 70510 | XDR | False | 06/20 | VISA 16 digit | 9.53500 | -13.68778 | Camayenne | GN |
1496 | 21867 | 7301 | 2019-03-02 18:32:43 | 75234 | CNY | True | 02/23 | Diners Club / Carte Blanche | 37.49073 | -2.77259 | Baza | ES |
1497 | 10016 | 7301 | 2019-02-05 06:14:25 | 61331 | LKR | True | 02/27 | VISA 13 digit | 37.49073 | -2.77259 | Baza | ES |
1498 | 15971 | 7301 | 2019-01-03 13:19:19 | 77846 | GTQ | True | 01/24 | JCB 15 digit | 37.49073 | -2.77259 | Baza | ES |
1499 | 23638 | 7301 | 2019-04-26 17:20:41 | 76886 | USD | True | 02/29 | American Express | 37.49073 | -2.77259 | Baza | ES |
1500 rows × 12 columns
Adding noise and unclean data¶
This data is already clean and compatible with EvalML’s AutoMLSearch
. In order to demonstrate EvalML default data checks, we will add the following:
A column of mostly null values (<0.5% non-null)
A column with low/no variance
A row of null values
A missing target value
We will add the first two columns to the whole dataset and we will only add the last two to the training data. Note: these only represent some of the scenarios that EvalML default data checks can catch.
[3]:
# add a column with no variance in the data
X['no_variance'] = [1 for _ in range(X.shape[0])]
# add a column with >99.5% null values
X['mostly_nulls'] = [None] * (X.shape[0] - 5) + [i for i in range(5)]
# since we changed the data, let's reinitialize the woodwork datatable
X.ww.init()
# let's split some training and validation data
X_train, X_valid, y_train, y_valid = split_data(X, y, problem_type='binary')
[4]:
# let's copy the datetime at row 1 for future use
date = X_train.iloc[1]['datetime']
# make row 1 all nan values
X_train.iloc[1] = [None] * X_train.shape[1]
# make one of the target values null
y_train[990] = None
X_train.ww.init()
y_train = ww.init_series(y_train)
# Let's take another look at the new X_train data
X_train
[4]:
card_id | store_id | datetime | amount | currency | customer_present | expiration_date | provider | lat | lng | region | country | no_variance | mostly_nulls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
872 | 15492.0 | 2868.0 | 2019-08-03 02:50:04 | 80719.0 | HNL | True | 08/27 | American Express | 5.47090 | 100.24529 | Batu Feringgi | MY | 1.0 | NaN |
1477 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
158 | 22440.0 | 6813.0 | 2019-07-12 11:07:25 | 1849.0 | SEK | True | 09/20 | American Express | 26.26490 | 81.54855 | Jais | IN | 1.0 | NaN |
808 | 8096.0 | 8096.0 | 2019-06-11 21:33:36 | 41358.0 | MOP | True | 04/29 | VISA 13 digit | 59.37722 | 28.19028 | Narva | EE | 1.0 | NaN |
336 | 33270.0 | 1529.0 | 2019-03-23 21:44:00 | 32594.0 | CUC | False | 04/22 | Mastercard | 51.39323 | 0.47713 | Strood | GB | 1.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
339 | 8484.0 | 5358.0 | 2019-01-10 07:47:28 | 89503.0 | GMD | False | 11/24 | Maestro | 47.30997 | 8.52462 | Adliswil | CH | 1.0 | NaN |
1383 | 17565.0 | 3929.0 | 2019-01-15 01:11:02 | 14264.0 | DKK | True | 06/20 | VISA 13 digit | 50.72043 | 11.34046 | Rudolstadt | DE | 1.0 | NaN |
893 | 108.0 | 44.0 | 2019-05-17 00:53:39 | 93218.0 | SLL | True | 12/24 | JCB 16 digit | 15.72892 | 120.57224 | Burgos | PH | 1.0 | NaN |
385 | 29983.0 | 152.0 | 2019-06-09 06:50:29 | 41105.0 | RWF | False | 07/20 | JCB 16 digit | -6.80000 | 39.25000 | Magomeni | TZ | 1.0 | NaN |
1074 | 26197.0 | 4927.0 | 2019-05-22 15:57:27 | 50481.0 | MNT | False | 05/26 | JCB 15 digit | 41.00510 | -73.78458 | Scarsdale | US | 1.0 | NaN |
1200 rows × 14 columns
If we call AutoMLSearch.search()
on this data, the search will fail due to the columns and issues we’ve added above. Note: we use a try/except here to catch the resulting ValueError that AutoMLSearch raises.
[5]:
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
try:
automl.search()
except ValueError as e:
# to make the error message more distinct
print("=" * 80, "\n")
print("Search errored out! Message received is: {}".format(e))
print("=" * 80, "\n")
================================================================================
Search errored out! Message received is: Input contains NaN, infinity or a value too large for dtype('float64').
================================================================================
We can use the search_iterative()
function provided in EvalML to determine what potential health issues our data has. We can see that this search_iterative function is a public method available through evalml.automl
and is different from the search function of the AutoMLSearch
class in EvalML. This search_iterative()
function allows us to run the default data checks on the data, and, if there are no errors, automatically runs AutoMLSearch.search()
.
[6]:
from evalml.automl import search_iterative
results = search_iterative(X_train, y_train, problem_type='binary')
results
[6]:
(None,
{'warnings': [{'message': '1 out of 1200 rows are 95.0% or more null',
'data_check_name': 'HighlyNullDataCheck',
'level': 'warning',
'details': {'columns': None,
'rows': [1477],
'pct_null_cols': id
1477 1.0
dtype: float64},
'code': 'HIGHLY_NULL_ROWS'},
{'message': "Columns 'mostly_nulls' are 95.0% or more null",
'data_check_name': 'HighlyNullDataCheck',
'level': 'warning',
'details': {'columns': ['mostly_nulls'],
'rows': None,
'pct_null_rows': {'mostly_nulls': 0.9966666666666667},
'null_row_indices': {'mostly_nulls': [872,
1477,
158,
808,
336,
1262,
685,
789,
611,
1080,
1249,
587,
787,
750,
1002,
877,
43,
247,
411,
496,
389,
1084,
224,
1057,
258,
246,
1204,
329,
1077,
548,
913,
561,
388,
357,
1401,
1206,
308,
475,
1159,
343,
1061,
404,
536,
410,
350,
10,
937,
186,
1454,
729,
233,
173,
208,
297,
467,
1311,
1446,
143,
919,
50,
288,
802,
778,
1089,
53,
1006,
958,
1093,
765,
1141,
932,
490,
966,
251,
1411,
979,
33,
134,
426,
637,
1302,
863,
1342,
114,
1130,
1152,
799,
184,
282,
657,
1215,
298,
917,
1324,
111,
777,
440,
2,
58,
104,
1455,
838,
211,
754,
356,
840,
1272,
1150,
1298,
485,
309,
511,
1026,
377,
1281,
290,
896,
30,
61,
1381,
691,
753,
458,
159,
950,
1117,
470,
927,
126,
621,
1178,
582,
851,
638,
67,
99,
22,
1365,
1245,
1452,
1188,
252,
57,
1170,
892,
501,
447,
1071,
358,
156,
538,
1485,
222,
537,
1145,
1493,
1277,
1343,
240,
627,
414,
512,
931,
390,
1149,
1417,
203,
847,
21,
174,
823,
738,
865,
1165,
1376,
1158,
549,
991,
580,
442,
19,
815,
218,
1201,
466,
281,
1161,
1416,
1379,
795,
726,
1487,
254,
1212,
593,
1037,
785,
700,
926,
1143,
1096,
1151,
660,
557,
1256,
629,
1146,
1390,
1462,
878,
626,
540,
809,
1373,
1226,
680,
1409,
1332,
150,
1048,
1424,
256,
646,
742,
670,
572,
1413,
903,
393,
828,
1184,
166,
352,
676,
816,
216,
632,
794,
210,
427,
154,
129,
1200,
1094,
628,
884,
702,
215,
361,
105,
45,
954,
756,
113,
843,
221,
804,
302,
834,
862,
1451,
527,
528,
544,
978,
797,
267,
1494,
748,
1440,
331,
124,
437,
1363,
819,
569,
1239,
971,
595,
643,
295,
1136,
482,
268,
459,
806,
747,
1443,
1090,
402,
1490,
1195,
755,
127,
37,
718,
679,
642,
1394,
299,
316,
792,
619,
1279,
1403,
681,
1484,
262,
740,
1164,
1140,
509,
487,
1095,
518,
1395,
533,
800,
183,
44,
1331,
359,
194,
1284,
279,
1397,
1448,
655,
1297,
82,
177,
1157,
59,
330,
1225,
418,
1274,
1260,
484,
601,
1385,
188,
1368,
912,
47,
76,
1220,
274,
307,
70,
1313,
618,
1398,
460,
510,
301,
1193,
51,
1441,
1244,
1153,
1473,
353,
1283,
212,
1021,
1464,
530,
951,
594,
347,
255,
1135,
1179,
321,
1221,
232,
1035,
97,
1126,
814,
1415,
253,
869,
1209,
419,
42,
1023,
1005,
1300,
707,
710,
272,
616,
1483,
832,
1301,
1160,
915,
886,
1248,
669,
1412,
1450,
287,
868,
115,
1059,
856,
1459,
964,
818,
891,
650,
885,
344,
1222,
1014,
323,
1076,
724,
276,
434,
196,
56,
1129,
17,
265,
1436,
1481,
472,
758,
682,
1029,
1064,
1347,
1348,
471,
78,
812,
1214,
1018,
693,
1119,
242,
468,
481,
204,
214,
829,
934,
661,
1329,
248,
965,
381,
163,
871,
517,
842,
15,
486,
412,
1056,
514,
55,
524,
1092,
1320,
1088,
1474,
168,
1429,
600,
14,
962,
999,
1017,
391,
1402,
581,
523,
706,
408,
1445,
692,
1254,
397,
1465,
848,
1291,
1314,
260,
1309,
507,
370,
1189,
1327,
749,
1432,
576,
996,
895,
110,
369,
454,
573,
526,
1044,
407,
901,
1491,
16,
1453,
995,
659,
969,
923,
337,
1306,
841,
1112,
833,
1469,
320,
376,
592,
639,
332,
300,
1107,
438,
711,
453,
1155,
1147,
144,
1174,
970,
898,
607,
181,
249,
71,
1434,
4,
1137,
976,
1316,
1217,
1482,
1259,
574,
791,
1111,
546,
664,
190,
1323,
155,
73,
1012,
998,
340,
1181,
452,
918,
1410,
663,
1364,
392,
1449,
133,
1192,
972,
387,
464,
1362,
770,
1237,
1073,
567,
1010,
1230,
1322,
405,
1207,
314,
413,
1386,
1173,
1176,
92,
562,
899,
1457,
606,
981,
920,
1270,
767,
1461,
1211,
1148,
112,
364,
1191,
1175,
1437,
1315,
285,
382,
698,
1267,
422,
617,
857,
922,
1439,
801,
764,
446,
719,
782,
1472,
1400,
771,
280,
6,
957,
374,
1427,
1055,
783,
870,
171,
1144,
529,
1422,
293,
269,
1276,
1268,
1003,
1264,
935,
752,
614,
543,
226,
556,
278,
790,
563,
1418,
714,
106,
668,
525,
506,
1289,
1350,
1310,
876,
1116,
1039,
947,
1101,
328,
647,
1099,
360,
1051,
138,
701,
1344,
1466,
1229,
955,
304,
547,
207,
29,
86,
1238,
986,
3,
450,
448,
1292,
757,
831,
24,
18,
781,
1138,
674,
416,
610,
652,
192,
980,
89,
586,
1232,
883,
1052,
346,
220,
551,
941,
1004,
90,
132,
1263,
1360,
502,
1231,
1407,
136,
495,
1486,
1389,
963,
1040,
921,
441,
286,
779,
456,
1265,
488,
867,
32,
476,
1287,
1223,
38,
1041,
271,
671,
1162,
477,
745,
519,
348,
1299,
1054,
1391,
351,
1258,
583,
23,
88,
1123,
570,
1241,
875,
283,
423,
365,
1305,
1456,
555,
72,
552,
11,
1288,
1105,
1492,
929,
673,
130,
1213,
900,
1375,
1016,
52,
690,
324,
987,
1046,
793,
827,
5,
635,
837,
478,
1341,
1318,
494,
363,
20,
1183,
406,
1013,
732,
665,
727,
1190,
161,
41,
542,
515,
7,
1043,
1166,
146,
1019,
1169,
983,
608,
151,
761,
605,
653,
1340,
142,
942,
444,
602,
993,
744,
584,
1068,
1399,
84,
1050,
1156,
866,
25,
1269,
303,
1303,
409,
624,
424,
383,
1251,
145,
712,
1286,
431,
162,
137,
1351,
743,
1120,
881,
904,
908,
798,
236,
1349,
720,
1405,
1022,
367,
1290,
305,
717,
534,
550,
1377,
175,
858,
349,
379,
197,
425,
128,
1053,
774,
1038,
721,
513,
735,
1171,
182,
553,
930,
666,
497,
1024,
306,
327,
1372,
179,
415,
1186,
1378,
469,
775,
860,
1275,
1199,
1282,
1479,
874,
731,
762,
1066,
1154,
119,
825,
1218,
54,
60,
603,
1321,
1307,
994,
1414,
909,
1334,
1243,
641,
107,
1070,
625,
658,
577,
1435,
1369,
199,
1396,
936,
1185,
1109,
180,
788,
417,
560,
1032,
1091,
686,
429,
120,
699,
541,
457,
433,
366,
803,
539,
465,
152,
559,
480,
338,
317,
1113,
263,
403,
683,
1075,
8,
1246,
599,
1387,
257,
91,
822,
380,
201,
1009,
961,
1049,
63,
185,
1115,
34,
1079,
118,
483,
622,
399,
1133,
850,
1392,
985,
1036,
0,
62,
839,
1104,
852,
209,
1139,
228,
575,
1366,
401,
384,
945,
667,
967,
123,
889,
122,
...]}},
'code': 'HIGHLY_NULL_COLS'}],
'errors': [{'message': '1 row(s) (0.08333333333333334%) of target values are null',
'data_check_name': 'InvalidTargetDataCheck',
'level': 'error',
'details': {'columns': None,
'rows': None,
'num_null_rows': 1,
'pct_null_rows': 0.08333333333333334},
'code': 'TARGET_HAS_NULL'},
{'message': "'no_variance' has 1 unique value.",
'data_check_name': 'NoVarianceDataCheck',
'level': 'error',
'details': {'columns': ['no_variance'], 'rows': None},
'code': 'NO_VARIANCE'},
{'message': 'Input datetime column(s) (datetime) contains NaN values. Please impute NaN values or drop these rows or columns.',
'data_check_name': 'DateTimeNaNDataCheck',
'level': 'error',
'details': {'columns': ['datetime'], 'rows': None},
'code': 'DATETIME_HAS_NAN'}],
'actions': [{'code': 'DROP_ROWS',
'metadata': {'columns': None, 'rows': [1477]}},
{'code': 'DROP_COL',
'metadata': {'columns': ['mostly_nulls'], 'rows': None}},
{'code': 'IMPUTE_COL',
'metadata': {'columns': None,
'rows': None,
'is_target': True,
'impute_strategy': 'most_frequent'}},
{'code': 'DROP_COL',
'metadata': {'columns': ['no_variance'], 'rows': None}}]})
The return value of the search_iterative
function above is a tuple. The first element is the AutoMLSearch
object if it runs (and None
otherwise), and the second element is a dictionary of potential warnings and errors that the default data checks find on the passed-in X
and y
data. In this dictionary, warnings are suggestions that the datachecks give that can useful to address to make the search better but will not break AutoMLSearch. On the flip side, errors will break
AutoMLSearch and need to be addressed by the user.
Addressing DataCheck errors¶
We will show that we can address errors to allow AutoMLSearch to run. However, ignoring warnings will come at the expense of performance.
We can print out the errors first to make it easier to read, and then we’ll create new features and targets from the original training data.
[7]:
results[1]['errors']
[7]:
[{'message': '1 row(s) (0.08333333333333334%) of target values are null',
'data_check_name': 'InvalidTargetDataCheck',
'level': 'error',
'details': {'columns': None,
'rows': None,
'num_null_rows': 1,
'pct_null_rows': 0.08333333333333334},
'code': 'TARGET_HAS_NULL'},
{'message': "'no_variance' has 1 unique value.",
'data_check_name': 'NoVarianceDataCheck',
'level': 'error',
'details': {'columns': ['no_variance'], 'rows': None},
'code': 'NO_VARIANCE'},
{'message': 'Input datetime column(s) (datetime) contains NaN values. Please impute NaN values or drop these rows or columns.',
'data_check_name': 'DateTimeNaNDataCheck',
'level': 'error',
'details': {'columns': ['datetime'], 'rows': None},
'code': 'DATETIME_HAS_NAN'}]
[8]:
# copy the DataTables to new variables
X_train_no_errors = X_train.copy()
y_train_no_errors = y_train.copy()
# We address the errors by looking at the resulting dictionary errors listed
# first, let's address the `TARGET_HAS_NULL` error
y_train_no_errors.fillna(False, inplace=True)
# here, we address the `NO_VARIANCE` error
X_train_no_errors.drop("no_variance", axis=1, inplace=True)
# lastly, we address the `DATETIME_HAS_NAN` error with the date we had saved earlier
X_train_no_errors.iloc[1, 2] = date
# let's reinitialize the Woodwork DataTable
X_train_no_errors.ww.init()
X_train_no_errors.head()
[8]:
card_id | store_id | datetime | amount | currency | customer_present | expiration_date | provider | lat | lng | region | country | mostly_nulls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
872 | 15492.0 | 2868.0 | 2019-08-03 02:50:04 | 80719.0 | HNL | True | 08/27 | American Express | 5.47090 | 100.24529 | Batu Feringgi | MY | NaN |
1477 | NaN | NaN | 2019-08-05 21:05:57 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
158 | 22440.0 | 6813.0 | 2019-07-12 11:07:25 | 1849.0 | SEK | True | 09/20 | American Express | 26.26490 | 81.54855 | Jais | IN | NaN |
808 | 8096.0 | 8096.0 | 2019-06-11 21:33:36 | 41358.0 | MOP | True | 04/29 | VISA 13 digit | 59.37722 | 28.19028 | Narva | EE | NaN |
336 | 33270.0 | 1529.0 | 2019-03-23 21:44:00 | 32594.0 | CUC | False | 04/22 | Mastercard | 51.39323 | 0.47713 | Strood | GB | NaN |
We can now run search on X_train_no_errors
and y_train_no_errors
. Note that the search here doesn’t fail since we addressed the errors, but there will still exist warnings in the returned tuple. This search allows the mostly_nulls
column to remain in the features during search.
[9]:
results_no_errors = search_iterative(X_train_no_errors, y_train_no_errors, problem_type='binary')
results_no_errors
High coefficient of variation (cv >= 0.5) within cross validation scores.
Decision Tree Classifier w/ Label Encoder + DateTime Featurization Component + Imputer + One Hot Encoder + Oversampler may not perform as estimated on unseen data.
[9]:
(<evalml.automl.automl_search.AutoMLSearch at 0x7f658e601af0>,
{'warnings': [{'message': "Columns 'mostly_nulls' are 95.0% or more null",
'data_check_name': 'HighlyNullDataCheck',
'level': 'warning',
'details': {'columns': ['mostly_nulls'],
'rows': None,
'pct_null_rows': {'mostly_nulls': 0.9966666666666667},
'null_row_indices': {'mostly_nulls': [872,
1477,
158,
808,
336,
1262,
685,
789,
611,
1080,
1249,
587,
787,
750,
1002,
877,
43,
247,
411,
496,
389,
1084,
224,
1057,
258,
246,
1204,
329,
1077,
548,
913,
561,
388,
357,
1401,
1206,
308,
475,
1159,
343,
1061,
404,
536,
410,
350,
10,
937,
186,
1454,
729,
233,
173,
208,
297,
467,
1311,
1446,
143,
919,
50,
288,
802,
778,
1089,
53,
1006,
958,
1093,
765,
1141,
932,
490,
966,
251,
1411,
979,
33,
134,
426,
637,
1302,
863,
1342,
114,
1130,
1152,
799,
184,
282,
657,
1215,
298,
917,
1324,
111,
777,
440,
2,
58,
104,
1455,
838,
211,
754,
356,
840,
1272,
1150,
1298,
485,
309,
511,
1026,
377,
1281,
290,
896,
30,
61,
1381,
691,
753,
458,
159,
950,
1117,
470,
927,
126,
621,
1178,
582,
851,
638,
67,
99,
22,
1365,
1245,
1452,
1188,
252,
57,
1170,
892,
501,
447,
1071,
358,
156,
538,
1485,
222,
537,
1145,
1493,
1277,
1343,
240,
627,
414,
512,
931,
390,
1149,
1417,
203,
847,
21,
174,
823,
738,
865,
1165,
1376,
1158,
549,
991,
580,
442,
19,
815,
218,
1201,
466,
281,
1161,
1416,
1379,
795,
726,
1487,
254,
1212,
593,
1037,
785,
700,
926,
1143,
1096,
1151,
660,
557,
1256,
629,
1146,
1390,
1462,
878,
626,
540,
809,
1373,
1226,
680,
1409,
1332,
150,
1048,
1424,
256,
646,
742,
670,
572,
1413,
903,
393,
828,
1184,
166,
352,
676,
816,
216,
632,
794,
210,
427,
154,
129,
1200,
1094,
628,
884,
702,
215,
361,
105,
45,
954,
756,
113,
843,
221,
804,
302,
834,
862,
1451,
527,
528,
544,
978,
797,
267,
1494,
748,
1440,
331,
124,
437,
1363,
819,
569,
1239,
971,
595,
643,
295,
1136,
482,
268,
459,
806,
747,
1443,
1090,
402,
1490,
1195,
755,
127,
37,
718,
679,
642,
1394,
299,
316,
792,
619,
1279,
1403,
681,
1484,
262,
740,
1164,
1140,
509,
487,
1095,
518,
1395,
533,
800,
183,
44,
1331,
359,
194,
1284,
279,
1397,
1448,
655,
1297,
82,
177,
1157,
59,
330,
1225,
418,
1274,
1260,
484,
601,
1385,
188,
1368,
912,
47,
76,
1220,
274,
307,
70,
1313,
618,
1398,
460,
510,
301,
1193,
51,
1441,
1244,
1153,
1473,
353,
1283,
212,
1021,
1464,
530,
951,
594,
347,
255,
1135,
1179,
321,
1221,
232,
1035,
97,
1126,
814,
1415,
253,
869,
1209,
419,
42,
1023,
1005,
1300,
707,
710,
272,
616,
1483,
832,
1301,
1160,
915,
886,
1248,
669,
1412,
1450,
287,
868,
115,
1059,
856,
1459,
964,
818,
891,
650,
885,
344,
1222,
1014,
323,
1076,
724,
276,
434,
196,
56,
1129,
17,
265,
1436,
1481,
472,
758,
682,
1029,
1064,
1347,
1348,
471,
78,
812,
1214,
1018,
693,
1119,
242,
468,
481,
204,
214,
829,
934,
661,
1329,
248,
965,
381,
163,
871,
517,
842,
15,
486,
412,
1056,
514,
55,
524,
1092,
1320,
1088,
1474,
168,
1429,
600,
14,
962,
999,
1017,
391,
1402,
581,
523,
706,
408,
1445,
692,
1254,
397,
1465,
848,
1291,
1314,
260,
1309,
507,
370,
1189,
1327,
749,
1432,
576,
996,
895,
110,
369,
454,
573,
526,
1044,
407,
901,
1491,
16,
1453,
995,
659,
969,
923,
337,
1306,
841,
1112,
833,
1469,
320,
376,
592,
639,
332,
300,
1107,
438,
711,
453,
1155,
1147,
144,
1174,
970,
898,
607,
181,
249,
71,
1434,
4,
1137,
976,
1316,
1217,
1482,
1259,
574,
791,
1111,
546,
664,
190,
1323,
155,
73,
1012,
998,
340,
1181,
452,
918,
1410,
663,
1364,
392,
1449,
133,
1192,
972,
387,
464,
1362,
770,
1237,
1073,
567,
1010,
1230,
1322,
405,
1207,
314,
413,
1386,
1173,
1176,
92,
562,
899,
1457,
606,
981,
920,
1270,
767,
1461,
1211,
1148,
112,
364,
1191,
1175,
1437,
1315,
285,
382,
698,
1267,
422,
617,
857,
922,
1439,
801,
764,
446,
719,
782,
1472,
1400,
771,
280,
6,
957,
374,
1427,
1055,
783,
870,
171,
1144,
529,
1422,
293,
269,
1276,
1268,
1003,
1264,
935,
752,
614,
543,
226,
556,
278,
790,
563,
1418,
714,
106,
668,
525,
506,
1289,
1350,
1310,
876,
1116,
1039,
947,
1101,
328,
647,
1099,
360,
1051,
138,
701,
1344,
1466,
1229,
955,
304,
547,
207,
29,
86,
1238,
986,
3,
450,
448,
1292,
757,
831,
24,
18,
781,
1138,
674,
416,
610,
652,
192,
980,
89,
586,
1232,
883,
1052,
346,
220,
551,
941,
1004,
90,
132,
1263,
1360,
502,
1231,
1407,
136,
495,
1486,
1389,
963,
1040,
921,
441,
286,
779,
456,
1265,
488,
867,
32,
476,
1287,
1223,
38,
1041,
271,
671,
1162,
477,
745,
519,
348,
1299,
1054,
1391,
351,
1258,
583,
23,
88,
1123,
570,
1241,
875,
283,
423,
365,
1305,
1456,
555,
72,
552,
11,
1288,
1105,
1492,
929,
673,
130,
1213,
900,
1375,
1016,
52,
690,
324,
987,
1046,
793,
827,
5,
635,
837,
478,
1341,
1318,
494,
363,
20,
1183,
406,
1013,
732,
665,
727,
1190,
161,
41,
542,
515,
7,
1043,
1166,
146,
1019,
1169,
983,
608,
151,
761,
605,
653,
1340,
142,
942,
444,
602,
993,
744,
584,
1068,
1399,
84,
1050,
1156,
866,
25,
1269,
303,
1303,
409,
624,
424,
383,
1251,
145,
712,
1286,
431,
162,
137,
1351,
743,
1120,
881,
904,
908,
798,
236,
1349,
720,
1405,
1022,
367,
1290,
305,
717,
534,
550,
1377,
175,
858,
349,
379,
197,
425,
128,
1053,
774,
1038,
721,
513,
735,
1171,
182,
553,
930,
666,
497,
1024,
306,
327,
1372,
179,
415,
1186,
1378,
469,
775,
860,
1275,
1199,
1282,
1479,
874,
731,
762,
1066,
1154,
119,
825,
1218,
54,
60,
603,
1321,
1307,
994,
1414,
909,
1334,
1243,
641,
107,
1070,
625,
658,
577,
1435,
1369,
199,
1396,
936,
1185,
1109,
180,
788,
417,
560,
1032,
1091,
686,
429,
120,
699,
541,
457,
433,
366,
803,
539,
465,
152,
559,
480,
338,
317,
1113,
263,
403,
683,
1075,
8,
1246,
599,
1387,
257,
91,
822,
380,
201,
1009,
961,
1049,
63,
185,
1115,
34,
1079,
118,
483,
622,
399,
1133,
850,
1392,
985,
1036,
0,
62,
839,
1104,
852,
209,
1139,
228,
575,
1366,
401,
384,
945,
667,
967,
123,
889,
122,
...]}},
'code': 'HIGHLY_NULL_COLS'}],
'errors': [],
'actions': [{'code': 'DROP_COL',
'metadata': {'columns': ['mostly_nulls'], 'rows': None}}]})
Addressing all warnings and errors¶
We can look at the actions
key of the dictionary in order to see how we can fix and clean all of the data. This will help us clean both the warnings and errors from the data and provide us with a better model.
[10]:
results[1]['actions']
[10]:
[{'code': 'DROP_ROWS', 'metadata': {'columns': None, 'rows': [1477]}},
{'code': 'DROP_COL', 'metadata': {'columns': ['mostly_nulls'], 'rows': None}},
{'code': 'IMPUTE_COL',
'metadata': {'columns': None,
'rows': None,
'is_target': True,
'impute_strategy': 'most_frequent'}},
{'code': 'DROP_COL', 'metadata': {'columns': ['no_variance'], 'rows': None}}]
We note that there are four action tasks that we can take to clean the data. Three of the tasks ask us to drop a row or column in the features, while one task asks us to impute the target value.
[11]:
# The first action states to drop the row given by the action code
X_train.drop(1477, axis=0, inplace=True)
# we must also drop this for y since we are removing its associated feature input
y_train.drop(index=1477, inplace=True)
print("The new length of X_train is {} and y_train is {}".format(len(X_train),len(y_train)))
The new length of X_train is 1199 and y_train is 1199
[12]:
# Remove the 'mostly_nulls' column from X_train, which is the second action item
X_train.drop('mostly_nulls', axis=1, inplace=True)
X_train.head()
[12]:
card_id | store_id | datetime | amount | currency | customer_present | expiration_date | provider | lat | lng | region | country | no_variance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
872 | 15492.0 | 2868.0 | 2019-08-03 02:50:04 | 80719.0 | HNL | True | 08/27 | American Express | 5.47090 | 100.24529 | Batu Feringgi | MY | 1.0 |
158 | 22440.0 | 6813.0 | 2019-07-12 11:07:25 | 1849.0 | SEK | True | 09/20 | American Express | 26.26490 | 81.54855 | Jais | IN | 1.0 |
808 | 8096.0 | 8096.0 | 2019-06-11 21:33:36 | 41358.0 | MOP | True | 04/29 | VISA 13 digit | 59.37722 | 28.19028 | Narva | EE | 1.0 |
336 | 33270.0 | 1529.0 | 2019-03-23 21:44:00 | 32594.0 | CUC | False | 04/22 | Mastercard | 51.39323 | 0.47713 | Strood | GB | 1.0 |
1262 | 22829.0 | 8390.0 | 2019-01-23 04:39:29 | 86149.0 | SCR | True | 03/25 | Maestro | 48.84490 | 17.22635 | Skalica | SK | 1.0 |
[13]:
# Address the null in targets, which is the third action item
y_train.fillna(False, inplace=True)
y_train.isna().any()
[13]:
False
[14]:
# Finally, we can drop the 'no_variance' column, which is the final action item
X_train.drop('no_variance', axis=1, inplace=True)
X_train.head()
[14]:
card_id | store_id | datetime | amount | currency | customer_present | expiration_date | provider | lat | lng | region | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
872 | 15492.0 | 2868.0 | 2019-08-03 02:50:04 | 80719.0 | HNL | True | 08/27 | American Express | 5.47090 | 100.24529 | Batu Feringgi | MY |
158 | 22440.0 | 6813.0 | 2019-07-12 11:07:25 | 1849.0 | SEK | True | 09/20 | American Express | 26.26490 | 81.54855 | Jais | IN |
808 | 8096.0 | 8096.0 | 2019-06-11 21:33:36 | 41358.0 | MOP | True | 04/29 | VISA 13 digit | 59.37722 | 28.19028 | Narva | EE |
336 | 33270.0 | 1529.0 | 2019-03-23 21:44:00 | 32594.0 | CUC | False | 04/22 | Mastercard | 51.39323 | 0.47713 | Strood | GB |
1262 | 22829.0 | 8390.0 | 2019-01-23 04:39:29 | 86149.0 | SCR | True | 03/25 | Maestro | 48.84490 | 17.22635 | Skalica | SK |
[15]:
# let's reinitialize the dataframe using Woodwork and try the search again
X_train.ww.init()
results_cleaned = search_iterative(X_train, y_train, problem_type='binary')
High coefficient of variation (cv >= 0.5) within cross validation scores.
Decision Tree Classifier w/ Label Encoder + DateTime Featurization Component + Imputer + One Hot Encoder + Oversampler may not perform as estimated on unseen data.
Note that this time, we do get an AutoMLSearch
object returned to us, as well as an empty dictionary of warnings and errors. We can use the AutoMLSearch
object as needed, and we can see that the resulting warning dictionary is empty.
[16]:
aml = results_cleaned[0]
aml.rankings
[16]:
id | pipeline_name | search_order | mean_cv_score | standard_deviation_cv_score | validation_score | percent_better_than_baseline | high_variance_cv | parameters | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | XGBoost Classifier w/ Label Encoder + DateTime... | 3 | 0.246143 | 0.040849 | 0.291312 | 94.692653 | False | {'DateTime Featurization Component': {'feature... |
1 | 6 | Random Forest Classifier w/ Label Encoder + Da... | 6 | 0.269118 | 0.007188 | 0.277285 | 94.197251 | False | {'DateTime Featurization Component': {'feature... |
2 | 4 | LightGBM Classifier w/ Label Encoder + DateTim... | 4 | 0.343009 | 0.059643 | 0.402779 | 92.604012 | False | {'DateTime Featurization Component': {'feature... |
3 | 8 | Extra Trees Classifier w/ Label Encoder + Date... | 8 | 0.356543 | 0.006981 | 0.364553 | 92.312192 | False | {'DateTime Featurization Component': {'feature... |
4 | 1 | Elastic Net Classifier w/ Label Encoder + Date... | 1 | 0.390674 | 0.022512 | 0.408353 | 91.576270 | False | {'DateTime Featurization Component': {'feature... |
5 | 2 | Logistic Regression Classifier w/ Label Encode... | 2 | 0.393342 | 0.022323 | 0.410692 | 91.518737 | False | {'DateTime Featurization Component': {'feature... |
6 | 5 | CatBoost Classifier w/ Label Encoder + DateTim... | 5 | 0.546942 | 0.001789 | 0.548985 | 88.206800 | False | {'DateTime Featurization Component': {'feature... |
7 | 7 | Decision Tree Classifier w/ Label Encoder + Da... | 7 | 1.094903 | 0.411451 | 1.036423 | 76.391644 | True | {'DateTime Featurization Component': {'feature... |
8 | 0 | Mode Baseline Binary Classification Pipeline | 0 | 4.637776 | 0.043230 | 4.662735 | 0.000000 | False | {'Baseline Classifier': {'strategy': 'mode'}} |
[17]:
data_check_results = results_cleaned[1]
data_check_results
[17]:
{'warnings': [], 'errors': [], 'actions': []}
Comparing removing only errors versus removing both warnings and errors¶
Let’s see the differences in model performance when we remove only errors versus remove both warnings and errors. To do this, we compare the performance of the best pipelines on the validation data. Remember that in the search where we only address errors, we still have the mostly_nulls
column present in the data, so we leave that column in the validation data for its respective search. We drop the other no_variance
column from both searches.
Additionally, we do some logical type setting since we had added additional noise to just the training data. This allows the data to be of the same types in both training and validation.
[18]:
# drop the no_variance column
X_valid.drop("no_variance", axis=1, inplace=True)
# logical type management
X_valid.ww.init(logical_types={"customer_present": "Categorical"})
y_valid = ww.init_series(y_valid, logical_type="Categorical")
best_pipeline_no_errors = results_no_errors[0].best_pipeline
print("Only dropping errors:", best_pipeline_no_errors.score(X_valid, y_valid, ["Log Loss Binary"]), "\n")
# drop the mostly_nulls column and reinitialize the DataTable
X_valid.drop("mostly_nulls", axis=1, inplace=True)
X_valid.ww.init()
best_pipeline_clean = results_cleaned[0].best_pipeline
print("Addressing all actions:", best_pipeline_clean.score(X_valid, y_valid, ["Log Loss Binary"]), "\n")
Only dropping errors: OrderedDict([('Log Loss Binary', 0.23710868120188716)])
Addressing all actions: OrderedDict([('Log Loss Binary', 0.22485581094121954)])
We can compare the differences in model performance when we address all action items (warnings and errors) in comparison to when we only address errors. While it isn’t guaranteed that addressing all actions will always have better performance, we do recommend doing so since we only raise these issues when we believe the features have problems that could negatively impact or not benefit the search.
In the future, we aim to provide a helper function to allow users to quickly clean the data by taking in the list of actions and creating an appropriate pipeline of transformers to alter the data.