본문 바로가기
COURSERA

week 4_Train a text classifier using Amazon SageMaker BlazingText built-in algorithm 실습

by HYUNHP 2022. 7. 10.
반응형

안녕하세요, HELLO

 

오늘은 DeepLearning.AI, Amazon Web Services에서 진행하는 Practical Data Science Specialization의 첫 번째 과정인 "Analyze Datasets and Train ML Models using AutoML"을 정리하려고 합니다.

 

"Analyze Datasets and Train ML Models using AutoML"의 강의를 통해 'exploratory data analysis (EDA), automated machine learning (AutoML), and text classification algorithms에 대해서 배우게 됩니다. 강의는 아래와 같이 구성되어 있습니다.

 

~ Explore the Use Case and Analyze the Dataset

~ Data Bias and Feature Importance

~ Use Automated Machine Learning to train a Text Classifier

~ Built-in algorithms

 

"Analyze Datasets and Train ML Models using AutoML" 4주차 "Train a text classifier using Amazon SageMaker BlazingText built-in algorithm"의 실습 내용입니다.


■ Introduction

 

In this lab you will use SageMaker BlazingText built-in algorithm to predict the sentiment for each customer review. BlazingText is a variant of FastText which is based on word2vec. For more information on BlazingText, see the documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html


CHAPTER 1. 'Prepare dataset'

 

CHAPTER 2. 'Train the model'

 

CHAPTER 3. 'Deploy the model'

 

CHAPTER 4. 'Test the model'


CHAPTER 0. 'Install and import modules'

 

# please ignore warning messages during the installation
!pip install --disable-pip-version-check -q sagemaker==2.35.0
!pip install --disable-pip-version-check -q nltk==3.5
import boto3
import sagemaker
import pandas as pd
import numpy as np
import botocore

config = botocore.config.Config(user_agent_extra='dlai-pds/c1/w4')

# low-level service client of the boto3 session
sm = boto3.client(service_name='sagemaker', 
                  config=config)

sm_runtime = boto3.client('sagemaker-runtime',
                          config=config)

sess = sagemaker.Session(sagemaker_client=sm,
                         sagemaker_runtime_client=sm_runtime)

bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

CHAPTER 1. 'Prepare dataset'

 

Let's adapt the dataset into a format that BlazingText understands. The BlazingText format is as follows:

__label__<label> "<features>"

 

Here are some examples:

__label__-1 "this is bad"
__label__0 "this is ok"
__label__1 "this is great"

Sentiment is one of three classes: negative (-1), neutral (0), or positive (1). BlazingText requires that __label__ is prepended to each sentiment value.

You will tokenize the review_body with the Natural Language Toolkit (nltk) for the model training. nltk documentation can be found here. You will also use nltk later in this lab to tokenize reviews to use as inputs to the deployed model.


□ Load the dataset


Upload the dataset into the Pandas dataframe:

 

!aws s3 cp 's3://dlai-practical-data-science/data/balanced/womens_clothing_ecommerce_reviews_balanced.csv' ./

path = './womens_clothing_ecommerce_reviews_balanced.csv'

df = pd.read_csv(path, delimiter=',')
df.head()


□ Transform the dataset


Now you will prepend __label__ to each sentiment value and tokenize the review body using nltk module. Let's import the module and download the tokenizer:

 

import nltk
nltk.download('punkt')

 

The output of word tokenization can be converted into a string separated by spaces and saved in the dataframe. The transformed sentences are prepared then for better text understending by the model.

Let's define a prepare_data function which you will apply later to transform both training and validation datasets.


□ Exercise 1

 

Apply the tokenizer to each of the reviews in the review_body column of the dataframe df.

 

def tokenize(review):
    # delete commas and quotation marks, apply tokenization and join back into a string separating by spaces
    return ' '.join([str(token) for token in nltk.word_tokenize(str(review).replace(',', '').replace('"', '').lower())])
    
def prepare_data(df):
    df['sentiment'] = df['sentiment'].map(lambda sentiment : '__label__{}'.format(str(sentiment).replace('__label__', '')))
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    df['review_body'] = df['review_body'].map(lambda review : tokenize(review)) # Replace all None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    return df

 

Apply the prepare_data function to the dataset.

 

df_blazingtext = df[['sentiment', 'review_body']].reset_index(drop=True)
df_blazingtext = prepare_data(df_blazingtext)
df_blazingtext.head()


□ Split the dataset into train and validation sets


Split and visualize a pie chart of the train (90%) and validation (10%) sets. You can do the split using the sklearn model function.

 

from sklearn.model_selection import train_test_split

# Split all data into 90% train and 10% holdout
df_train, df_validation = train_test_split(df_blazingtext, 
                                           test_size=0.10,
                                           stratify=df_blazingtext['sentiment'])

labels = ['train', 'validation']
sizes = [len(df_train.index), len(df_validation.index)]
explode = (0.1, 0)  

fig1, ax1 = plt.subplots()

ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)

# Equal aspect ratio ensures that pie is drawn as a circle.
ax1.axis('equal')  

plt.show()
print(len(df_train))


□ Upload the train and validation datasets to S3 bucket


You will use these to train and validate your model. Let's save them to S3 bucket.

 

train_s3_uri = sess.upload_data(bucket=bucket, key_prefix='blazingtext/data', path=blazingtext_train_path)
validation_s3_uri = sess.upload_data(bucket=bucket, key_prefix='blazingtext/data', path=blazingtext_validation_path)

 

728x90

 

CHAPTER 2. 'Train the model'

Setup the BlazingText estimator. For more information on Estimators, see the SageMaker Python SDK documentation here: https://sagemaker.readthedocs.io/.

 

□ Exercise 2

Setup the container image to use for training with the BlazingText algorithm.

Instructions: Use the sagemaker.image_uris.retrieve function with the blazingtext algorithm.

image_uri = sagemaker.image_uris.retrieve(
    region=region,
    framework='...' # the name of framework or algorithm
)
image_uri = sagemaker.image_uris.retrieve(
    region=region,
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    framework='blazingtext' # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
)

□ Exercise 3

 

Create an estimator instance passing the container image and other instance parameters.

 

Instructions: Pass the container image prepared above into the sagemaker.estimator.Estimator function.

Note: For the purposes of this lab, you will use a relatively small instance type. Please refer to this link for additional instance types that may work for your use case outside of this lab.

 

estimator = sagemaker.estimator.Estimator(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    image_uri=image_uri, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    role=role, 
    instance_count=1, 
    instance_type='ml.m5.large',
    volume_size=30,
    max_run=7200,
    sagemaker_session=sess
)

 

Configure the hyper-parameters for BlazingText. You are using BlazingText for a supervised classification task. For more information on the hyper-parameters, see the documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext-tuning.html

 

The hyperparameters that have the greatest impact on word2vec objective metrics are: learning_rate and vector_dim.

 

estimator.set_hyperparameters(mode='supervised',   # supervised (text classification)
                              epochs=10,           # number of complete passes through the dataset: 5 - 15
                              learning_rate=0.01,  # step size for the  numerical optimizer: 0.005 - 0.01
                              min_count=2,         # discard words that appear less than this number: 0 - 100                              
                              vector_dim=300,      # number of dimensions in vector space: 32-300
                              word_ngrams=3)       # number of words in a word n-gram: 1 - 3

 

To call the fit method for the created estimator instance you need to setup the input data channels. This can be organized as a dictionary

data_channels = {
    'train': ..., # training data
    'validation': ... # validation data
}

where training and validation data are the Amazon SageMaker channels for S3 input data sources.


□ Exercise 4

 

Create a train data channel.

Instructions: Pass the S3 input path for training data into the sagemaker.inputs.TrainingInput function.

 

train_data = sagemaker.inputs.TrainingInput(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    train_s3_uri, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    distribution='FullyReplicated', 
    content_type='text/plain', 
    s3_data_type='S3Prefix'
)

□ Exercise 5

 

Create a validation data channel.

Instructions: Pass the S3 input path for validation data into the sagemaker.inputs.TrainingInput function.

 

validation_data = sagemaker.inputs.TrainingInput(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    validation_s3_uri, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    distribution='FullyReplicated', 
    content_type='text/plain', 
    s3_data_type='S3Prefix'
)

□ Exercise 6

 

Organize the data channels defined above as a dictionary.

 

data_channels = {
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    'train': train_data, # Replace None
    'validation': validation_data # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
}

□ Exercise 7

 

Start fitting the model to the dataset.

Instructions: Call the fit method of the estimator passing the configured train and validation inputs (data channels).

 

estimator.fit(
    inputs=..., # train and validation input
    wait=False # do not wait for the job to complete before continuing
)
estimator.fit(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    inputs=data_channels, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    wait=False
)

training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

# Training Job Name:  blazingtext-2022-05-13-06-59-22-439

Wait for the training job to complete.

This cell will take approximately 5-10 minutes to run.

 

%%time

estimator.latest_training_job.wait(logs=False)

 

반응형

 

CHAPTER 3. 'Deploy the model'

 

Now deploy the trained model as an Endpoint.

This cell will take approximately 5-10 minutes to run.

 

%%time

text_classifier = estimator.deploy(initial_instance_count=1,
                                   instance_type='ml.m5.large',
                                   serializer=sagemaker.serializers.JSONSerializer(),
                                   deserializer=sagemaker.deserializers.JSONDeserializer())

print()
print('Endpoint name:  {}'.format(text_classifier.endpoint_name))

CHAPTER 4. 'Test the model'

 

Import the nltk library to convert the raw reviews into tokens that BlazingText recognizes.

import nltk
nltk.download('punkt')

 

Specify sample reviews to predict the sentiment.

reviews = ['This product is great!',
           'OK, but not great',
           'This is not the right product.']

 

Tokenize the reviews and specify the payload to use when calling the REST API.

tokenized_reviews = [' '.join(nltk.word_tokenize(review)) for review in reviews]

payload = {"instances" : tokenized_reviews}
print(payload)

Now you can predict the sentiment for each review. Call the predict method of the text classifier passing the tokenized sentence instances (payload) into the data argument.

 

predictions = text_classifier.predict(data=payload)
for prediction in predictions:
    print('Predicted class: {}'.format(prediction['label'][0].lstrip('__label__')))

■ 마무리

 

"Analyze Datasets and Train ML Models using AutoML" 4주차 "Train a text classifier using Amazon SageMaker BlazingText built-in algorithm"의 실습에 대해서 정리해봤습니다.

 

그럼 오늘 하루도 즐거운 나날 되길 기도하겠습니다

좋아요와 댓글 부탁드립니다 :)

 

감사합니다.

 

반응형

댓글