Introduction
Using a pre-trained Language Model like BERT (Bidirectional Encoder Representations from Transformers), we can leverage contextual embeddings to enhance the ability to understand and analyze natural language text. This blog will look into the process of fine-tuning BERT for sentiment analysis classification, building a classifier on top of the transformer to adapt it to a certain use case.
BERT is a transformer-based model that is pretrained on a large corpus of text using techniques like: attention, masked language modeling and next sentence prediction. This pre-training enabled BERT to capture a deep understanding of language nuances, context, and grammar.
The Dataset
We will be using IMDB Movie Review dataset. The data consists of a review (free text) and the sentiment, whether it is positive or negative. We will not go in depth on how to deal with text data and preprocess it for modeling. The article focuses on the fine-tuning part. Of course, with more text preprocessing we will achieve better results and it is the best practice.
We will start by loading the data and superficially cleaning it
import gc import re import pandas as pd import torch from transformers import AutoTokenizer, AutoModel, get_cosine_schedule_with_warmup from sklearn.model_selection import train_test_split reviews = pd.read_csv("IMDB Dataset.csv") print(reviews.shape) reviews.head() reviews["sentiment"].value_counts() # (50000, 2) # review sentiment # 0 One of the other reviewers has mentioned that ... positive # 1 A wonderful little production. <br /><br />The... positive # 2 I thought this was a wonderful way to spend ti... positive # 3 Basically there's a family where a little boy ... negative # 4 Petter Mattei's "Love in the Time of Money" is... positive # sentiment # positive 25000 # negative 25000 # Name: count, dtype: int64
Now we will convert the label column to integers and clean the text
reviews["label"] = 1 reviews.loc[reviews["sentiment"] == "negative", "label"] = 0 def clean_review(review): html_tag = re.compile('<.*?>') cleaned_review = re.sub(html_tag, "", review).split() return " ".join(cleaned_review) print("## before cleaning") text = reviews.review[0] print(text[:200]) # ## before cleaning # One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me abo print("\n## after cleaning") cleaned_text = clean_review(text) print(cleaned_text[:200]) # ## after cleaning # One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was it ## cleaning the review column reviews["cleaned_review"] = reviews["review"].apply(lambda x: clean_review(x))
Let’s now split the data into training and test. We will use only 1000 records for training and 1000 for test.
X_train, X_test, y_train, y_test = train_test_split( reviews["cleaned_review"], reviews["label"], # to get 1k for training test_size = 0.98, random_state = 13) y_train.value_counts() # label # 1 534 # 0 466 # Name: count, dtype: int64 y_test[:1000].value_counts() # label # 0 501 # 1 499 # Name: count, dtype: int64
Fine-Tuning BERT for Classification
BERT excels in understanding the text and producing contextual embeddings that capture very well the essence of a text. These embeddings can be very helpful in so many cases like our use case but we’d like to somehow adapt it (fine-tune) to our data so that it aligns with the task’s requirements i.e. mapping the text to a sentiment. The process will involve adding a task-specific layer on top of BERT’s output and training the model on the dataset. So technically speaking, we will add a linear layer on top of the contextualized embeddings layer of BERT `pooler_output`.
`pooler_output` represents the embedding of the CLS (classification) token passed through some more layers. In BERT it is used to predict whether or not Sentence2 is a sentence that directly follows Sentence1 in “next sentence prediction” task. So CLS is a token that represents the entire sequence i.e. sentence-level understanding. And as pooler_output is basically the embedding at CLS transformed with a linear layer, we will use pooler_output as a representation of the contextualized information about the input sequence on top of which we will add a task-specific linear layer to fine-tune the model to our task i.e. binary classifying the input to positive or negative.
Model and Process Definition
Let’s now build the model and its tokenizer
class BertSentimentClassifier(torch.nn.Module): def __init__(self, model_name): super(BertSentimentClassifier, self).__init__() self.bert = AutoModel.from_pretrained(model_name) # binary classification self.cls_head = torch.nn.Linear(self.bert.config.hidden_size, 1) self.loss_fn = torch.nn.BCELoss() def forward(self, input_ids, attention_mask, token_type_ids, labels = None): bert_output = self.bert(input_ids = input_ids, attention_mask = attention_mask, token_type_ids = token_type_ids) logits = self.cls_head(bert_output.pooler_output) probs = torch.nn.functional.sigmoid(logits).squeeze(-1) loss = None if labels is not None: loss = self.loss_fn(probs, labels) return loss, probs ## tokenizer BERT_MODEL = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL) ## model's device if GPU exists if torch.cuda.is_available(): device = "cuda" else: device = "cpu" ## model model = BertSentimentClassifier(BERT_MODEL).to(device)
We will use Adam optimizer with a scheduler that has some warm-up steps, a process of gradually increasing the learning rate in the early few steps which is useful for stabilizing training. We will train the model for 3 epochs only.
optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5) scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps = 10, num_training_steps = 100) epochs = 3 batch_size = 20
Model Training
Now we define the training loop, the main part, and run the training
model.train() losses = [] for epoch in range(epochs): in_losses = [] print(f"epoch: {epoch}") total_records = 0 correct_records = 0 # train on batches for i in range(0, len(X_train), batch_size): batch_data = tokenizer(X_train[i:i+batch_size].tolist(), return_tensors = "pt", padding = True, truncation = True).to(device) batch_y = torch.FloatTensor(y_train[i:i+batch_size].tolist()).to(device) optimizer.zero_grad() loss, logits = model(input_ids = batch_data.input_ids, attention_mask = batch_data.attention_mask, token_type_ids = batch_data.token_type_ids, labels = batch_y) loss.backward() optimizer.step() scheduler.step() in_losses.append(loss) total_records += batch_size correct_records += torch.sum((1 * logits >= 0.5) == batch_y).item() # clear cache to avoid cuda out of memory torch.cuda.empty_cache() _ = gc.collect() epoch_loss = sum(in_losses) / len(in_losses) losses.append(epoch_loss) accuracy = correct_records / total_records print(f"train loss: {epoch_loss}, accuracy: {accuracy}") # epoch: 0 # train loss: 0.5525103211402893, accuracy: 0.695 # epoch: 1 # train loss: 0.23345758020877838, accuracy: 0.921 # epoch: 2 # train loss: 0.15327207744121552, accuracy: 0.957
Great! The model has improved a lot in just three epochs.
Model Evaluation
Now it is time to evaluate the model on unseen data to measure its performance and generalization
model.eval() total_records = 0 correct_records = 0 batch_size = 20 for i in range(0, len(X_test[:1000]), batch_size): batch_data = tokenizer(X_test[i:i+batch_size].tolist(), return_tensors = "pt", padding = True, truncation = True).to(device) batch_y = torch.FloatTensor(y_test[i:i+batch_size].tolist()).to(device) _, logits = model(input_ids = batch_data.input_ids, attention_mask = batch_data.attention_mask, token_type_ids = batch_data.token_type_ids, labels = batch_y) total_records += batch_size correct_records += torch.sum((1 * logits >= 0.5) == batch_y).item() torch.cuda.empty_cache() _ = gc.collect() accuracy = correct_records / total_records print(f"accuracy: {accuracy}") # accuracy: 0.897
The model has given very nice results.
BTW, we have previously built a sentiment classifier on the same dataset using CNN and LSTM here in this post.
Conclusion
BERT is a powerful model that helps in a lot of NLP tasks and fine-tuning it on specific tasks can give amazing results. By just adding a single linear layer on top of BERT’s output, we can achieve major performance boost as it excels in capturing a deep understanding of the text. Here we have just scratched the surface of what BERT and other Encoder Transformers are capable of when it comes to text embeddings.
Leave a Reply