Using Text Data with EvalML#

In this demo, we will show you how to use EvalML to build models which use text data.

[1]:
import evalml
from evalml import AutoMLSearch
pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace.

Dataset#

We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not (“ham”). This dataset is originally from Kaggle, but modified to produce a slightly more even distribution of spam to ham.

[2]:
from urllib.request import urlopen
import pandas as pd

input_data = urlopen(
    "https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv"
)
data = pd.read_csv(input_data)[:750]

X = data.drop(["Category"], axis=1)
y = data["Category"]

display(X.head())
Message
0 Free entry in 2 a wkly comp to win FA Cup fina...
1 FreeMsg Hey there darling it's been 3 week's n...
2 WINNER!! As a valued network customer you have...
3 Had your mobile 11 months or more? U R entitle...
4 SIX chances to win CASH! From 100 to 20,000 po...

The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than a trivial baseline model which simply classifies everything as ham.

[3]:
y.value_counts(normalize=True)
[3]:
spam    0.593333
ham     0.406667
Name: Category, dtype: float64

In order to properly utilize Woodwork’s ‘Natural Language’ typing, we need to pass this argument in during initialization. Otherwise, this will be treated as an ‘Unknown’ type and dropped in the search.

[4]:
X.ww.init(logical_types={"Message": "NaturalLanguage"})

Search for best pipeline#

In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.

[5]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)

EvalML uses Woodwork to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data. We can print out the logical type of the Message column and assert that it is indeed inferred as a natural language column.

[6]:
X_train.ww
[6]:
Physical Type Logical Type Semantic Tag(s)
Column
Message string NaturalLanguage []

Because the spam/ham labels are binary, we will use AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary'). When we call .search(), the search for the best pipeline will begin.

[7]:
automl = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    max_batches=1,
    optimize_thresholds=True,
    verbose=True,
)

automl.search(interactive_plot=False)
AutoMLSearch will use mean CV score to rank pipelines.

*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of None pipelines.
Allowed model families:

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 14.046

*****************************
* Evaluating Batch Number 1 *
*****************************

Logistic Regression Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Standard Scaler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.324
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.209

Search finished after 00:10
Best pipeline: Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer
Best pipeline Log Loss Binary: 0.209360