Date:

Create and fine-tune sentence transformers for enhanced classification accuracy


Sentence transformers are highly effective deep studying fashions that convert sentences into high-quality, fixed-length embeddings, capturing their semantic which means. These embeddings are helpful for varied pure language processing (NLP) duties reminiscent of textual content classification, clustering, semantic search, and knowledge retrieval.

On this submit, we showcase the right way to fine-tune a sentence transformer particularly for classifying an Amazon product into its product class (reminiscent of toys or sporting items). We showcase two totally different sentence transformers, paraphrase-MiniLM-L6-v2 and a proprietary Amazon giant language mannequin (LLM) known as M5_ASIN_SMALL_V2.0, and examine their outcomes. M5 LLMS are BERT-based LLMs fine-tuned on inner Amazon product catalog knowledge utilizing product title, bullet factors, description, and extra. They’re presently getting used to be used circumstances reminiscent of automated product classification and related product suggestions. Our speculation is that M5_ASIN_SMALL_V2.0 will carry out higher for the use case of Amazon product class classification as a consequence of it being fine-tuned with Amazon product knowledge. We show this speculation within the following experiment illustrated on this submit.

Answer overview

On this submit, we exhibit the right way to fine-tune a sentence transformer with Amazon product knowledge and the right way to use the ensuing sentence transformer to enhance classification accuracy of product classes utilizing an XGBoost choice tree. For this demonstration, we use a public Amazon product dataset known as Amazon Product Dataset 2020 from a kaggle competitors. This dataset accommodates the next attributes and fields:

  • Area title – amazon.com
  • Date vary – January 1, 2020, by January 31, 2020
  • File extension – CSV
  • Obtainable fields – Uniq Id, Product Identify, Model Identify, Asin, Class, Upc Ean Code, Listing Worth, Promoting Worth, Amount, Mannequin Quantity, About Product, Product Specification, Technical Particulars, Transport Weight, Product Dimensions, Picture, Variants, SKU, Product Url, Inventory, Product Particulars, Dimensions, Colour, Substances, Course To Use, Is Amazon Vendor, Measurement Amount Variant, and Product Description
  • Label area – Class

Conditions

Earlier than you start, set up the next packages. You are able to do this in both an Amazon SageMaker pocket book or your native Jupyter pocket book by working the next instructions:

!pip set up sentencepiece --quiet
!pip set up sentence_transformers --quiet
!pip set up xgboost –-quiet
!pip set up scikit-learn –-quiet/

Preprocess the info

Step one wanted for fine-tuning a sentence transformer is to preprocess the Amazon product knowledge for the sentence transformer to have the ability to devour the info and fine-tune successfully. It includes normalizing the textual content knowledge, defining the product’s essential class by extracting the primary class from the Class area, and choosing an important fields from the dataset that contribute to classifying the product’s essential class precisely. We use the next code for preprocessing:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

knowledge = pd.read_csv('marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv')
knowledge.columns = knowledge.columns.str.decrease().str.change(' ', '_')
knowledge['main_category'] = knowledge['category'].str.cut up("|").str[0]
knowledge["all_text"] = knowledge.apply(
    lambda r: " ".be a part of(
        [
            str(r["product_name"]) if pd.notnull(r["product_name"]) else "",
            str(r["about_product"]) if pd.notnull(r["about_product"]) else "",
            str(r["product_specification"]) if pd.notnull(r["product_specification"]) else "",
            str(r["technical_details"]) if pd.notnull(r["technical_details"]) else ""
        ]
    ),
    axis=1
)
label_encoder = LabelEncoder()
labels_transform = label_encoder.fit_transform(knowledge['main_category'])
knowledge['label']=labels_transform
knowledge[['all_text','label']]

The next screenshot reveals an instance of what our dataset appears to be like like after it has been preprocessed.

High quality-tune the sentence transformer paraphrase-MiniLM-L6-v2

The primary sentence transformer we fine-tune known as paraphrase-MiniLM-L6-v2. It makes use of the favored BERT mannequin as its underlying structure to remodel product description textual content right into a 384-dimensional dense vector embedding that will probably be consumed by our XGBoost classifier for product class classification. We use the next code to fine-tune paraphrase-MiniLM-L6-v2 utilizing the preprocessed Amazon product knowledge:

from sentence_transformers import SentenceTransformer
model_name="paraphrase-MiniLM-L6-v2"
mannequin = SentenceTransformer(model_name)

Step one is to outline a classification head that represents the 24 product classes that an Amazon product may be labeled into. This classification head will probably be used to coach the sentence transformer particularly to be simpler at remodeling product descriptions based on the 24 product classes. The concept is that every one product descriptions which are throughout the similar class ought to be remodeled right into a vector embedding that’s nearer in distance in comparison with product descriptions that belong in several classes.

 The next code is for fine-tuning sentence transformer 1:

import torch.nn as nn

# Outline classification head
class ClassificationHead(nn.Module):
    def __init__(self, embedding_dim, num_classes):
        tremendous(ClassificationHead, self).__init__()
        self.linear = nn.Linear(embedding_dim, num_classes)

    def ahead(self, options):
        x = options['sentence_embedding']
        x = self.linear(x)
        return x

# Outline the variety of courses for a classification activity.
num_classes = 24
print('class quantity:', num_classes)
classification_head = ClassificationHead(mannequin.get_sentence_embedding_dimension(), num_classes)

# Mix SentenceTransformer mannequin and classification head."
class SentenceTransformerWithHead(nn.Module):
    def __init__(self, transformer, head):
        tremendous(SentenceTransformerWithHead, self).__init__()
        self.transformer = transformer
        self.head = head

    def ahead(self, enter):
        options = self.transformer(enter)
        logits = self.head(options)
        return logits

model_with_head = SentenceTransformerWithHead(mannequin, classification_head)

We then set the fine-tuning parameters. For this submit, we practice on 5 epochs, optimize for cross-entropy loss, and use the AdamW optimization technique. We selected epoch 5 as a result of, after testing varied epoch values, we noticed that the loss minimized at epoch 5. This made it the optimum variety of coaching iterations for attaining the very best classification outcomes.

The next code is for fine-tuning sentence transformer 2:

import os
os.environ["TORCH_USE_CUDA_DSA"] = "1"
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from sentence_transformers import SentenceTransformer, InputExample, LoggingHandler
import torch
from torch.utils.knowledge import DataLoader
from transformers import AdamW, get_linear_schedule_with_warmup

train_sentences = knowledge['all_text']
train_labels = knowledge['label']
# coaching parameters
num_epochs = 5
batch_size = 2
learning_rate = 2e-5

# Convert the dataset to PyTorch tensors.
train_examples = [InputExample(texts=[s], label=l) for s, l in zip(train_sentences, train_labels)]

# Customise collate_fn to transform InputExample objects into tensors.
def collate_fn(batch):
    texts = [example.texts[0] for instance in batch]
    labels = torch.tensor([example.label for example in batch])
    return texts, labels

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)

# Outline the loss operate, optimizer, and studying price scheduler.
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model_with_head.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Coaching loop
loss_list=[]
for epoch in vary(num_epochs):
    model_with_head.practice()
    for step, (texts, labels) in enumerate(train_dataloader):
        labels = labels.to(mannequin.gadget)
        optimizer.zero_grad()

        # Encode textual content and cross by classification head.
        inputs = mannequin.tokenize(texts)
        input_ids = inputs['input_ids'].to(mannequin.gadget)
        input_attention_mask = inputs['attention_mask'].to(mannequin.gadget)
        inputs_final = {'input_ids': input_ids, 'attention_mask': input_attention_mask}
        
        # transfer model_with_head to the identical gadget
        model_with_head = model_with_head.to(mannequin.gadget)
        logits = model_with_head(inputs_final)
        
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()
        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, Loss: {loss.merchandise()}")

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.merchandise()}')
    model_save_path = f'./intermediate-output/epoch-{epoch}'
    mannequin.save(model_save_path)
    loss_list.append(loss.merchandise())
# Save the ultimate mannequin
model_final_save_path="st_ft_epoch_5"
mannequin.save(model_final_save_path)

To look at whether or not our ensuing fine-tuned sentence transformer improves our product class classification accuracy, we use it as our textual content embedder within the XGBoost classifier within the subsequent step.

XGBoost classification

XGBoost (Excessive Gradient Boosting) classification is a machine studying method used for classification duties. It’s an implementation of the gradient boosting framework designed to be environment friendly, versatile, and moveable. For this submit, we have now XGBoost devour the product description textual content embedding output of our sentence transformers and observe product class classification accuracy. We use the next code to make use of the usual paraphrase-MiniLM-L6-v2 sentence transformer earlier than it was fine-tuned to categorise Amazon merchandise to their respective classes:

from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

mannequin = SentenceTransformer('paraphrase-MiniLM-L6-v2')  
knowledge['text_embedding'] = knowledge['all_text'].apply(lambda x: mannequin.encode(str(x)))
text_embeddings = pd.DataFrame(knowledge['text_embedding'].tolist(), index=knowledge.index, dtype=float)

# Convert numeric columns saved as strings to floats
numeric_columns = ['selling_price', 'shipping_weight', 'product_dimensions']  # Add extra columns as wanted
for col in numeric_columns:
    knowledge[col] = pd.to_numeric(knowledge[col], errors="coerce")

# Convert categorical columns to class kind
categorical_columns = ['model_number', 'is_amazon_seller']  # Add extra columns as wanted
for col in categorical_columns:
    knowledge[col] = knowledge[col].astype('class')
    
X_0 = knowledge[['selling_price','model_number','is_amazon_seller']]
X = pd.concat([X_0, text_embeddings], axis=1)
label_encoder = LabelEncoder()
knowledge['main_category_encoded'] = label_encoder.fit_transform(knowledge['main_category'])
y = knowledge['main_category_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Re-encode the labels to make sure they're consecutive integers ranging from 0
unique_labels = sorted(set(y_train) | set(y_test))
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

y_train = y_train.map(label_mapping)
y_test = y_test.map(label_mapping)

# Allow categorical help for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

param = {
    'max_depth': 6,
    'eta': 0.3,
    'goal': 'multi:softmax',
    'num_class': len(label_mapping),
    'eval_metric': 'mlogloss'
}

num_round = 100
bst = xgb.practice(param, dtrain, num_round)

# Consider the mannequin
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.78

We observe a 78% accuracy utilizing the inventory paraphrase-MiniLM-L6-v2 sentence transformer. To look at the outcomes of the fine-tuned paraphrase-MiniLM-L6-v2 sentence transformer, we have to replace the start of the code as follows. All different code stays the identical.

mannequin = SentenceTransformer('st_ft_epoch_5')  
knowledge['text_embedding_miniLM_ft10'] = knowledge['all_text'].apply(lambda x: mannequin.encode(str(x)))
text_embeddings = pd.DataFrame(knowledge['text_embedding_finetuned'].tolist(), index=knowledge.index, dtype=float)
X_pa_finetuned = pd.concat([X_0, text_embeddings], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_pa_finetuned, y, test_size=0.2, random_state=42)

# Re-encode the labels to make sure they're consecutive integers ranging from 0
unique_labels = sorted(set(y_train) | set(y_test))
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

y_train = y_train.map(label_mapping)
y_test = y_test.map(label_mapping)

# Construct and practice the XGBoost mannequin
# Allow categorical help for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

param = {
    'max_depth': 6,
    'eta': 0.3,
    'goal': 'multi:softmax',
    'num_class': len(label_mapping),
    'eval_metric': 'mlogloss'
}

num_round = 100
bst = xgb.practice(param, dtrain, num_round)

y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Optionally, convert the expected labels again to the unique class labels
inverse_label_mapping = {idx: label for label, idx in label_mapping.gadgets()}
y_pred_labels = pd.Sequence(y_pred).map(inverse_label_mapping)

Accuracy: 0.94

With the fine-tuned paraphrase-MiniLM-L6-v2 sentence transformer, we observe a 94% accuracy, a 16% improve from the baseline of 78% accuracy. From this statement, we conclude that fine-tuning paraphrase-MiniLM-L6-v2 is efficient for classifying Amazon product knowledge into product classes.

High quality-tune the sentence transformer M5_ASIN_SMALL_V20

Now we create a sentence transformer from a BERT-based mannequin known as M5_ASIN_SMALL_V2.0. It’s a 40-million-parameter BERT-based mannequin educated at M5, an inner staff at Amazon specializing in fine-tuning LLMs utilizing Amazon product knowledge. It was distilled from a bigger trainer mannequin (roughly 5 billion parameters), which was pre-trained on a considerable amount of unlabeled ASIN knowledge and pre-fine-tuned on a set of Amazon supervised studying duties (multi-task pre-fine-tuning). It’s a multi-task, multi-lingual, multi-locale, and multi-modal BERT-based encoder-only mannequin educated on textual content and structured knowledge enter. Its neural community architectural particulars are as follows:

Mannequin spine:
 Hidden dimension: 384
 Variety of hidden layers: 24
 Variety of consideration heads: 16
 Intermediate dimension: 1536
 Vocabulary dimension: 256,035
Variety of spine parameters: 42,587,904
Variety of phrase embedding parameters (bert.embedding.*): 98,517,504
Complete variety of parameters: 141,259,023

As a result of M5_ASIN_SMALL_V20 was pre-trained on Amazon product knowledge particularly, we hypothesize that constructing a sentence transformer from it’s going to improve the accuracy of product class classification. We full the next steps to construct a sentence transformer from M5_ASIN_SMALL_V20, fine-tune it, and enter it into an XGBoost classifier to watch accuracy influence:

  1. Load a pre-trained M5 mannequin that you simply need to use as the bottom encoder.
  2. Use the M5 mannequin throughout the SentenceTransformer framework to create a sentence transformer.
  3. Add a pooling layer to create fixed-size sentence embeddings from the variable-length output of the BERT mannequin.
  4. Mix the M5 mannequin and pooling layer right into a single mannequin.
  5. High quality-tune the mannequin on a related dataset.

See the next code for Steps 1–3:

from sentence_transformers import fashions 
from transformers import AutoTokenizer

# Step 1: Load Pre-trained M5 Mannequin
model_path="M5_ASIN_SMALL_V20"  # or your customized mannequin path
transformer_model = fashions.Transformer(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Step 2: Outline Pooling Layer
pooling_model = fashions.Pooling(transformer_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True)

# Step 3: Create SentenceTransformer Mannequin
model_mean_m5_base = SentenceTransformer(modules=[transformer_model, pooling_model])

The remainder of the code stays the identical as fine-tuning for the paraphrase-MiniLM-L6-v2 sentence transformer, besides that we use the fine-tuned M5 sentence transformer as a substitute to create embeddings for the texts within the dataset:

loaded_model = SentenceTransformer('m5_ft_epoch_5_mean')
knowledge['text_embedding_m5'] = knowledge['all_text'].apply(lambda x: loaded_model.encode(str(x)))

Outcome

We observe related outcomes to paraphrase-MiniLM-L6-v2 when accuracy earlier than fine-tuning, observing a 78% accuracy for M5_ASIN_SMALL_V20. Nonetheless, we observe that the fine-tuned M5_ASIN_SMALL_V20 sentence transformer performs higher than the fine-tuned paraphrase-MiniLM-L6-v2. Its accuracy is 98%, in comparison with 94% for the fine-tuned paraphrase-MiniLM-L6-v2. We fine-tuned the sentence transformers for five epochs, as a result of experiments confirmed this was the optimum quantity to reduce loss. The next graph summarizes our observations of accuracy enchancment with fine-tuning for five epochs in a single comparability chart.

Clear up

We suggest utilizing GPUs to fine-tune the sentence transformers, for instance, ml.g5.4xlarge or ml.g4dn.16xlarge. Make sure to clear up sources to keep away from incurring further prices.

When you’re utilizing a SageMaker pocket book occasion, seek advice from Clear up Amazon SageMaker pocket book occasion sources. When you’re utilizing Amazon SageMaker Studio, seek advice from Delete or cease your Studio working cases, purposes, and areas.

Conclusion

On this submit, we explored sentence transformers and the right way to use them successfully for textual content classification duties. We dived deep into the sentence transformer paraphrase-MiniLM-L6-v2, demonstrated the right way to use a BERT-based mannequin like M5_ASIN_SMALL_V20 to create a sentence transformer, confirmed the right way to fine-tune sentence transformers, and confirmed the accuracy results of fine-tuning sentence transformers.

High quality-tuning sentence transformers has confirmed to be extremely efficient for classifying product descriptions into classes, considerably enhancing prediction accuracy. As a subsequent step, we encourage you to discover totally different sentence transformers from Hugging Face.

Lastly, if you wish to discover M5, be aware that it’s proprietary to Amazon and you’ll solely entry it as an Amazon associate or buyer as of the time of this publication. Join together with your Amazon level of contact should you’re an Amazon associate or buyer wanting to make use of M5, and they’ll information you thru M5’s choices and the way it may be used in your use case.


In regards to the Authors

Kara Yang is a Information Scientist at AWS Skilled Providers within the San Francisco Bay Space, with in depth expertise in AI/ML. She makes a speciality of leveraging cloud computing, machine studying, and Generative AI to assist prospects deal with advanced enterprise challenges throughout varied industries. Kara is enthusiastic about innovation and steady studying.

Farshad Harirchi is a Principal Information Scientist at AWS Skilled Providers. He helps prospects throughout industries, from retail to industrial and monetary providers, with the design and growth of generative AI and machine studying options. Farshad brings in depth expertise in all the machine studying and MLOps stack. Outdoors of labor, he enjoys touring, enjoying outside sports activities, and exploring board video games.

James Poquiz is a Information Scientist with AWS Skilled Providers based mostly in Orange County, California. He has a BS in Laptop Science from the College of California, Irvine and has a number of years of expertise working within the knowledge area having performed many alternative roles. At this time he works on implementing and deploying scalable ML options to attain enterprise outcomes for AWS purchasers.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here