본문 바로가기
COURSERA

week 3_Train a model with Amazon SageMaker Autopilot 실습

by HYUNHP 2022. 7. 10.
728x90
반응형

안녕하세요, HELLO

 

오늘은 DeepLearning.AI, Amazon Web Services에서 진행하는 Practical Data Science Specialization의 첫 번째 과정인 "Analyze Datasets and Train ML Models using AutoML"을 정리하려고 합니다.

 

"Analyze Datasets and Train ML Models using AutoML"의 강의를 통해 'exploratory data analysis (EDA), automated machine learning (AutoML), and text classification algorithms에 대해서 배우게 됩니다. 강의는 아래와 같이 구성되어 있습니다.

 

~ Explore the Use Case and Analyze the Dataset

~ Data Bias and Feature Importance

~ Use Automated Machine Learning to train a Text Classifier

~ Built-in algorithms

 

"Analyze Datasets and Train ML Models using AutoML" 3주차 "Train a model with Amazon SageMaker Autopilot"의 실습 내용입니다.


Introduction

In this lab, you will use Amazon Sagemaker Autopilot to train a BERT-based natural language processing (NLP) model. The model will analyze customer feedback and classify the messages into positive (1), neutral (0) and negative (-1) sentiment.


CHAPTER 1. 'Review transformed dataset'

 

CHAPTER 2. 'Configure the Autopilot job'

 

CHAPTER 3. 'Launch the Autopilot job'

 

CHAPTER 4. 'Track Autopilot job progress'

 

CHAPTER 5. 'Feature engineering'

 

CHAPTER 6. 'Model training and tuning'

 

CHAPTER 7. 'Review all output in S3 bucket'

 

CHAPTER 8. 'Deploy and test best candidate model'


CHAPTER 0. 'Import library'

 

# please ignore warning messages during the installation
!pip install --disable-pip-version-check -q sagemaker==2.35.0

import boto3
import sagemaker
import pandas as pd
import numpy as np
import botocore
import time
import json

config = botocore.config.Config(user_agent_extra='dlai-pds/c1/w3')

# low-level service client of the boto3 session
sm = boto3.client(service_name='sagemaker', 
                  config=config)

sm_runtime = boto3.client('sagemaker-runtime',
                          config=config)

sess = sagemaker.Session(sagemaker_client=sm,
                         sagemaker_runtime_client=sm_runtime)

bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

CHAPTER 1. 'Review transformed dataset'

 

Let's transform the dataset into a format that Autopilot recognizes. Specifically, a comma-separated file of label, features as shown here:

 

sentiment,review_body
-1,"this is bad"
0,"this is ok"
1,"this is great"
...

 

Sentiment is one of three classes: negative (-1), neutral (0), or positive (1). Autopilot requires that the target variable, sentiment is first and the set of features, just review_body in this case, come next.


!aws s3 cp 's3://dlai-practical-data-science/data/balanced/womens_clothing_ecommerce_reviews_balanced.csv' ./

path = './womens_clothing_ecommerce_reviews_balanced.csv'

df = pd.read_csv(path, delimiter=',')
df.head()

path_autopilot = './womens_clothing_ecommerce_reviews_balanced_for_autopilot.csv'

df[['sentiment', 'review_body']].to_csv(path_autopilot, 
                                        sep=',', 
                                        index=False)

 

 

CHAPTER 2. 'Configure the Autopilot job'

 

□ Upload data to S3 bucket

 

autopilot_train_s3_uri = sess.upload_data(bucket=bucket, key_prefix='autopilot/data', path=path_autopilot)

!aws s3 ls $autopilot_train_s3_uri

□ S3 output for generated assets

 

Set the S3 output path for the Autopilot outputs. This includes Jupyter notebooks (analysis), Python scripts (feature engineering), and trained models.

 

model_output_s3_uri = 's3://{}/autopilot'.format(bucket)

□ Configure the Autopilot job

 

Create the Autopilot job name.

 

import time

timestamp = int(time.time())

auto_ml_job_name = 'automl-dm-{}'.format(timestamp)

 

When configuring our Autopilot job, you need to specify the maximum number of candidates, max_candidates, to explore as well as the input/output S3 locations and target column to predict. In this case, you want to predict sentiment from the review text.


■ Exercise 1

 

Configure the Autopilot job.

 

Instructions: Create an instance of the sagemaker.automl.automl.AutoML estimator class passing the required configuration parameters. Target attribute for predictions here is sentiment.

 

automl = sagemaker.automl.automl.AutoML(
    target_attribute_name='...', # the name of the target attribute for predictions
    base_job_name=..., # Autopilot job name
    output_path=..., # output data path
    max_candidates=..., # maximum number of candidates
    sagemaker_session=sess,
    role=role,
    max_runtime_per_training_job_in_seconds=1200,
    total_job_runtime_in_seconds=7200
)

max_candidates = 3

automl = sagemaker.automl.automl.AutoML(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    target_attribute_name='sentiment', # Replace None
    base_job_name=auto_ml_job_name, # Replace None
    output_path=model_output_s3_uri, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    max_candidates=max_candidates,
    sagemaker_session=sess,
    role=role,
    max_runtime_per_training_job_in_seconds=1200,
    total_job_runtime_in_seconds=7200
)

 

반응형

 

CHAPTER 3. 'Launch the Autopilot job'

 

 Exercise 2

 

Launch the Autopilot job.

 

Instructions: Call fit function of the configured estimator passing the S3 bucket input data path and the Autopilot job name.

 

automl.fit(
    ..., # input data path
    job_name=auto_ml_job_name, # Autopilot job name
    wait=False, 
    logs=False
)

automl.fit(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    autopilot_train_s3_uri, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    job_name=auto_ml_job_name, 
    wait=False, 
    logs=False
)

CHAPTER 4. 'Track Autopilot job progress'


Once the Autopilot job has been launched, you can track the job progress directly from the notebook using the SDK capabilities.

 

□ Autopilot job description

 

Function describe_auto_ml_job of the Amazon SageMaker service returns the information about the AutoML job in dictionary format. You can review the response syntax and response elements in the documentation.

 

job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)

□ Autopilot job status

 

To track the job progress you can use two response elements: AutoMLJobStatus and AutoMLJobSecondaryStatus, which correspond to the primary (Completed | InProgress | Failed | Stopped | Stopping) and secondary (AnalyzingData | FeatureEngineering | ModelTuning etc.) job states respectively. To see if the AutoML job has started, you can check the existence of the AutoMLJobStatus and AutoMLJobSecondaryStatus elements in the job description response.

 

In this notebook, you will use the following scheme to track the job progress:

 

while 'AutoMLJobStatus' not in job_description_response.keys() and 'AutoMLJobSecondaryStatus' not in job_description_response.keys():
    job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
    print('[INFO] Autopilot job has not yet started. Please wait. ')
    # function `json.dumps` encodes JSON string for printing.
    print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))
    print('[INFO] Waiting for Autopilot job to start...')
    sleep(15)

print('[OK] AutoML job started.')

□ Review the SageMaker processing jobs

 

The Autopilot creates required SageMaker processing jobs during the run:

  • First processing job (data splitter) checks the data sanity, performs stratified shuffling and splits the data into training and validation.
  • Second processing job (candidate generator) first streams through the data to compute statistics for the dataset. Then, uses these statistics to identify the problem type, and possible types of every column-predictor: numeric, categorical, natural language, etc.
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/">processing jobs</a></b>'.format(region)))

□ Wait for the data analysis step to finish

 

Here you will use the same scheme as above to check the completion of the data analysis step. This step can be identified with the (primary) job status value InProgress and secondary job status values Starting and then AnalyzingData.

 

This cell will take approximately 10 minutes to run.

 

%%time

job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('Starting', 'AnalyzingData'):
        job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        time.sleep(15)
    print('[OK] Data analysis phase completed.\n')
    
print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))

Exercise 3

 

Check if the Autopilot job artifacts have been generated.

 

Instructions: Use status check scheme described above. The generation of artifacts can be identified by existence of AutoMLJobArtifacts element in the keys of the job description response.

 

### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
# get the information about the running Autopilot job
job_description_response = automl.describe_auto_ml_job(job_name = auto_ml_job_name) # Replace None

# keep in the while loop until the Autopilot job artifacts will be generated
while 'AutoMLJobArtifacts' not in job_description_response.keys(): # Replace all None
    # update the information about the running Autopilot job
    job_description_response = automl.describe_auto_ml_job(job_name = auto_ml_job_name) # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    print('[INFO] Autopilot job has not yet generated the artifacts. Please wait. ')
    print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))
    print('[INFO] Waiting for AutoMLJobArtifacts...')
    time.sleep(15)

print('[OK] AutoMLJobArtifacts generated.')

Exercise 4

 

Check if the notebooks have been created.

 

Instructions: Use status check scheme described above. Notebooks creation can be identified by existence of DataExplorationNotebookLocation element in the keys of the job_description_response['AutoMLJobArtifacts'] dictionary.

 

### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
# get the information about the running Autopilot job
job_description_response = automl.describe_auto_ml_job(job_name = auto_ml_job_name) # Replace None

# keep in the while loop until the notebooks will be created
while 'DataExplorationNotebookLocation' not in job_description_response['AutoMLJobArtifacts'].keys(): # Replace all None
    # update the information about the running Autopilot job
    job_description_response = automl.describe_auto_ml_job(job_name = auto_ml_job_name) # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    print('[INFO] Autopilot job has not yet generated the notebooks. Please wait. ')
    print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))
    print('[INFO] Waiting for DataExplorationNotebookLocation...')
    time.sleep(15)

print('[OK] DataExplorationNotebookLocation found.')

 

Review the generated resources in S3 directly. Following the link, you can find the notebooks in the folder notebooks and download them by clicking on object Actions/Object actions -> Download as/Download.


CHAPTER 5. 'Feature engineering'

 

 Exercise 5

 

Check the completion of the feature engineering step.

 

Instructions: Use status check scheme described above. Feature engineering step can be identified with the (primary) job status value InProgress and secondary job status value FeatureEngineering.

 

This cell will take approximately 10 minutes to run.

 

%%time

job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'): # Replace all None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
        job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        time.sleep(5)
    print('[OK] Feature engineering phase completed.\n')
    
print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))

CHAPTER 6. 'Model training and tuning'

 

When you launched the Autopilot job, you requested that 3 model candidates are generated and compared. Therefore, you should see three (3) SageMaker training jobs below.


 Exercise 6

 

Check the completion of the model tuning step.

Instructions: Use status check scheme described above. Model tuning step can be identified with the (primary) job status value InProgress and secondary job status value ModelTuning.

This cell will take approximately 5-10 minutes to run.

 

%%time

job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'): # Replace all None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
        job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        time.sleep(5)
    print('[OK] Model tuning phase completed.\n')
    
print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))

Finally, you can check the completion of the Autopilot job looking for the Completed job status.

 

%%time

from pprint import pprint

job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
pprint(job_description_response)
job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
print('Job status:  {}'.format(job_status))
print('Secondary job status:  {}'.format(job_sec_status))
if job_status not in ('Stopped', 'Failed'):
    while job_status not in ('Completed'):
        job_description_response = automl.describe_auto_ml_job(job_name=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print('Job status:  {}'.format(job_status))
        print('Secondary job status:  {}'.format(job_sec_status))        
        time.sleep(10)
    print('[OK] Autopilot job completed.\n')
else:
    print('Job status: {}'.format(job_status))
    print('Secondary job status: {}'.format(job_status))

 

Before moving to the next section make sure the status above indicates Autopilot job completed.


□ Compare model candidates

Once model tuning is complete, you can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by AutoML and sort them by their final performance metric.

 

■ Exercise 7

 

List candidates generated by Autopilot sorted by accuracy from highest to lowest.

Instructions: Use list_candidates function passing the Autopilot job name auto_ml_job_name with the accuracy field FinalObjectiveMetricValue. It returns the list of candidates with the information about them.

candidates = automl.list_candidates(
    job_name=..., # Autopilot job name
    sort_by='...' # accuracy field name
)
candidates = automl.list_candidates(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    job_name=auto_ml_job_name, # Replace None
    sort_by='FinalObjectiveMetricValue' # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
)

□ Review best candidate

Now that you have successfully completed the Autopilot job on the dataset and visualized the trials, you can get the information about the best candidate model and review it.


■ Exercise 8


Get the information about the generated best candidate job.

Instructions: Use best_candidate function passing the Autopilot job name. This function will give an error if candidates have not been generated.

 

candidates = automl.list_candidates(job_name=auto_ml_job_name)

if candidates != []:
    best_candidate = automl.best_candidate(
        ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
        job_name=auto_ml_job_name # Replace None
        ### END SOLUTION - DO NOT delete this comment for grading purposes
    )
    print(json.dumps(best_candidate, indent=4, sort_keys=True, default=str))

CHAPTER '7. Review all output in S3 bucket'

 

You will see the artifacts generated by Autopilot including the following:

data-processor-models/        # "models" learned to transform raw data into features 
documentation/                # explainability and other documentation about your model
preprocessed-data/            # data for train and validation
sagemaker-automl-candidates/  # candidate models which autopilot compares
transformed-data/             # candidate-specific data for train and validation
tuning/                       # candidate-specific tuning results
validations/                  # validation results

CHAPTER 8. 'Deploy and test best candidate model'

 

□ Deploy best candidate model

 

While batch transformations are supported, you will deploy our model as a REST Endpoint in this example.

First, you need to customize the inference response. The inference containers generated by SageMaker Autopilot allow you to select the response content for predictions. By default the inference containers are configured to generate the predicted_label. But you can add probability into the list of inference response keys.

 

inference_response_keys = ['predicted_label', 'probability']

 

Now you will create a SageMaker endpoint from the best candidate generated by Autopilot. Wait for SageMaker to deploy the endpoint.

This cell will take approximately 5-10 minutes to run.

 

autopilot_model = automl.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    candidate=best_candidate,
    inference_response_keys=inference_response_keys,
    predictor_cls=sagemaker.predictor.Predictor,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer()
)

print('\nEndpoint name:  {}'.format(autopilot_model.endpoint_name))

□ 8.2. Test the model


Invoke a few predictions for the actual reviews using the deployed endpoint.

 

#sm_runtime = boto3.client('sagemaker-runtime')

review_list = ['This product is great!',
               'OK, but not great.',
               'This is not the right product.']

for review in review_list:
    
    # remove commas from the review since we're passing the inputs as a CSV
    review = review.replace(",", "")

    response = sm_runtime.invoke_endpoint(
        EndpointName=autopilot_model.endpoint_name, # endpoint name
        ContentType='text/csv', # type of input data
        Accept='text/csv', # type of the inference in the response
        Body=review # review text
        )

    response_body=response['Body'].read().decode('utf-8').strip().split(',')

    print('Review: ', review, ' Predicated class: {}'.format(response_body[0]))

print("(-1 = Negative, 0=Neutral, 1=Positive)")

You used Amazon SageMaker Autopilot to automatically find the best model, hyper-parameters, and feature-engineering scripts for our dataset. Autopilot uses a uniquely-transparent approach to AutoML by generating re-usable Python scripts and notebooks.


■ 마무리

 

"Analyze Datasets and Train ML Models using AutoML" 3주차 "Train a model with Amazon SageMaker Autopilot"의 실습에 대해서 정리해봤습니다.

 

그럼 오늘 하루도 즐거운 나날 되길 기도하겠습니다

좋아요와 댓글 부탁드립니다 :)

 

감사합니다.

반응형

댓글