Fine-Tuning BERT for Sentiment Analysis

Introduction

Using a pre-trained Language Model like BERT (Bidirectional Encoder Representations from Transformers), we can leverage contextual embeddings to enhance the ability to understand and analyze natural language text. This blog will look into the process of fine-tuning BERT for sentiment analysis classification, building a classifier on top of the transformer to adapt it to a certain use case.

BERT is a transformer-based model that is pretrained on a large corpus of text using techniques like: attention, masked language modeling and next sentence prediction. This pre-training enabled BERT to capture a deep understanding of language nuances, context, and grammar.

The Dataset

We will be using IMDB Movie Review dataset. The data consists of a review (free text) and the sentiment, whether it is positive or negative. We will not go in depth on how to deal with text data and preprocess it for modeling. The article focuses on the fine-tuning part. Of course, with more text preprocessing we will achieve better results and it is the best practice.

We will start by loading the data and superficially cleaning it

import gc
import re
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel, get_cosine_schedule_with_warmup
from sklearn.model_selection import train_test_split


reviews = pd.read_csv("IMDB Dataset.csv")
print(reviews.shape)
reviews.head()
reviews["sentiment"].value_counts()

# (50000, 2)

#                                                  review      sentiment
# 0	One of the other reviewers has mentioned that ...	positive
# 1	A wonderful little production. <br /><br />The...	positive
# 2	I thought this was a wonderful way to spend ti...	positive
# 3	Basically there's a family where a little boy ...	negative
# 4	Petter Mattei's "Love in the Time of Money" is...	positive

# sentiment
# positive    25000
# negative    25000
# Name: count, dtype: int64

Now we will convert the label column to integers and clean the text

reviews["label"] = 1
reviews.loc[reviews["sentiment"] == "negative", "label"] = 0

def clean_review(review):
    html_tag = re.compile('<.*?>')
    cleaned_review = re.sub(html_tag, "", review).split()
    return " ".join(cleaned_review)

print("## before cleaning")
text = reviews.review[0]
print(text[:200])
# ## before cleaning
# One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me abo


print("\n## after cleaning")
cleaned_text = clean_review(text)
print(cleaned_text[:200])

# ## after cleaning
# One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was it

## cleaning the review column
reviews["cleaned_review"] = reviews["review"].apply(lambda x: clean_review(x))

Let’s now split the data into training and test. We will use only 1000 records for training and 1000 for test.

X_train, X_test, y_train, y_test = train_test_split(
                                         reviews["cleaned_review"],
                                         reviews["label"], 
                                         # to get 1k for training
                                         test_size = 0.98,
                                         random_state = 13)
y_train.value_counts()
# label
# 1    534
# 0    466
# Name: count, dtype: int64

y_test[:1000].value_counts()
# label
# 0    501
# 1    499
# Name: count, dtype: int64

Fine-Tuning BERT for Classification

BERT excels in understanding the text and producing contextual embeddings that capture very well the essence of a text. These embeddings can be very helpful in so many cases like our use case but we’d like to somehow adapt it (fine-tune) to our data so that it aligns with the task’s requirements i.e. mapping the text to a sentiment. The process will involve adding a task-specific layer on top of BERT’s output and training the model on the dataset. So technically speaking, we will add a linear layer on top of the contextualized embeddings layer of BERT `pooler_output`.

`pooler_output` represents the embedding of the CLS (classification) token passed through some more layers. In BERT it is used to predict whether or not Sentence2 is a sentence that directly follows Sentence1 in “next sentence prediction” task. So CLS is a token that represents the entire sequence i.e. sentence-level understanding. And as pooler_output is basically the embedding at CLS transformed with a linear layer, we will use pooler_output as a representation of the contextualized information about the input sequence on top of which we will add a task-specific linear layer to fine-tune the model to our task i.e. binary classifying the input to positive or negative.

Model and Process Definition

Let’s now build the model and its tokenizer

class BertSentimentClassifier(torch.nn.Module):
    def __init__(self, model_name):
        super(BertSentimentClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        # binary classification
        self.cls_head = torch.nn.Linear(self.bert.config.hidden_size, 1) 
        self.loss_fn = torch.nn.BCELoss()

    def forward(self, input_ids, attention_mask, token_type_ids, labels = None):
        bert_output = self.bert(input_ids = input_ids,
                                attention_mask = attention_mask,
                                token_type_ids = token_type_ids)
        logits = self.cls_head(bert_output.pooler_output)
        probs = torch.nn.functional.sigmoid(logits).squeeze(-1)
        loss = None
        if labels is not None:
            loss = self.loss_fn(probs, labels)
        return loss, probs


## tokenizer
BERT_MODEL = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL)

## model's device if GPU exists
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

## model
model = BertSentimentClassifier(BERT_MODEL).to(device)

We will use Adam optimizer with a scheduler that has some warm-up steps, a process of gradually increasing the learning rate in the early few steps which is useful for stabilizing training. We will train the model for 3 epochs only.

optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5)
scheduler = get_cosine_schedule_with_warmup(
                optimizer,
                num_warmup_steps = 10,
                num_training_steps = 100)

epochs = 3
batch_size = 20

Model Training

Now we define the training loop, the main part, and run the training

model.train()
losses = []
for epoch in range(epochs):
    in_losses = []
    print(f"epoch: {epoch}")
    total_records = 0
    correct_records = 0
    # train on batches
    for i in range(0, len(X_train), batch_size):
        batch_data = tokenizer(X_train[i:i+batch_size].tolist(),
                               return_tensors = "pt",
                               padding = True,
                               truncation = True).to(device)
        batch_y = torch.FloatTensor(y_train[i:i+batch_size].tolist()).to(device)
        optimizer.zero_grad()
        loss, logits = model(input_ids = batch_data.input_ids,
                             attention_mask = batch_data.attention_mask,
                             token_type_ids = batch_data.token_type_ids,
                             labels = batch_y)
        loss.backward()
        optimizer.step()
        scheduler.step()
        in_losses.append(loss)
        total_records += batch_size
        correct_records += torch.sum((1 * logits >= 0.5) == batch_y).item()
        # clear cache to avoid cuda out of memory
        torch.cuda.empty_cache()
        _ = gc.collect()
        
    epoch_loss = sum(in_losses) / len(in_losses)
    losses.append(epoch_loss)
    accuracy = correct_records / total_records
    print(f"train loss: {epoch_loss}, accuracy: {accuracy}")

# epoch: 0
# train loss: 0.5525103211402893, accuracy: 0.695
# epoch: 1
# train loss: 0.23345758020877838, accuracy: 0.921
# epoch: 2
# train loss: 0.15327207744121552, accuracy: 0.957

Great! The model has improved a lot in just three epochs.

Model Evaluation

Now it is time to evaluate the model on unseen data to measure its performance and generalization

model.eval()
total_records = 0
correct_records = 0
batch_size = 20
for i in range(0, len(X_test[:1000]), batch_size):
    batch_data = tokenizer(X_test[i:i+batch_size].tolist(),
                           return_tensors = "pt",
                           padding = True,
                           truncation = True).to(device)
    batch_y = torch.FloatTensor(y_test[i:i+batch_size].tolist()).to(device)
    _, logits = model(input_ids = batch_data.input_ids,
                         attention_mask = batch_data.attention_mask,
                         token_type_ids = batch_data.token_type_ids,
                         labels = batch_y)
    total_records += batch_size
    correct_records += torch.sum((1 * logits >= 0.5) == batch_y).item()

    torch.cuda.empty_cache()
    _ = gc.collect()
    
accuracy = correct_records / total_records
print(f"accuracy: {accuracy}")

# accuracy: 0.897

The model has given very nice results.
BTW, we have previously built a sentiment classifier on the same dataset using CNN and LSTM here in this post.

Conclusion

BERT is a powerful model that helps in a lot of NLP tasks and fine-tuning it on specific tasks can give amazing results. By just adding a single linear layer on top of BERT’s output, we can achieve major performance boost as it excels in capturing a deep understanding of the text. Here we have just scratched the surface of what BERT and other Encoder Transformers are capable of when it comes to text embeddings.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *