Using Text Data with EvalML#
In this demo, we will show you how to use EvalML to build models which use text data.
[1]:
import evalml
from evalml import AutoMLSearch
Dataset#
We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not (“ham”). This dataset is originally from Kaggle, but modified to produce a slightly more even distribution of spam to ham.
[2]:
from urllib.request import urlopen
import pandas as pd
input_data = urlopen(
"https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv"
)
data = pd.read_csv(input_data)[:750]
X = data.drop(["Category"], axis=1)
y = data["Category"]
display(X.head())
Message | |
---|---|
0 | Free entry in 2 a wkly comp to win FA Cup fina... |
1 | FreeMsg Hey there darling it's been 3 week's n... |
2 | WINNER!! As a valued network customer you have... |
3 | Had your mobile 11 months or more? U R entitle... |
4 | SIX chances to win CASH! From 100 to 20,000 po... |
The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than a trivial baseline model which simply classifies everything as ham.
[3]:
y.value_counts(normalize=True)
[3]:
Category
spam 0.593333
ham 0.406667
Name: proportion, dtype: float64
In order to properly utilize Woodwork’s ‘Natural Language’ typing, we need to pass this argument in during initialization. Otherwise, this will be treated as an ‘Unknown’ type and dropped in the search.
[4]:
X.ww.init(logical_types={"Message": "NaturalLanguage"})
Search for best pipeline#
In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.
[5]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
X, y, problem_type="binary", test_size=0.2, random_seed=0
)
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
EvalML uses Woodwork to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data. We can print out the logical type of the Message
column and assert that it is indeed inferred as a natural language column.
[6]:
X_train.ww
[6]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
Message | string | NaturalLanguage | [] |
Because the spam/ham labels are binary, we will use AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
. When we call .search()
, the search for the best pipeline will begin.
[7]:
automl = AutoMLSearch(
X_train=X_train,
y_train=y_train,
problem_type="binary",
max_batches=1,
optimize_thresholds=True,
verbose=True,
)
automl.search(interactive_plot=False)
AutoMLSearch will use mean CV score to rank pipelines.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of None pipelines.
Allowed model families:
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 14.658
*****************************
* Evaluating Batch Number 1 *
*****************************
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.249
Search finished after 7.79 seconds
Best pipeline: Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model
Best pipeline Log Loss Binary: 0.248763