Understanding SoTA Language Models (BERT, RoBERTA,...

Hi everyone,

There are a ton of language models out there today! Many of which have their unique way of learning “self-supervised” language representations that can be used by other downstream tasks.

In this article, I decided to summarize the current trends and share some key insights to glue all these novel approaches together. 😃 (Slide credits: Delvin et. al. Stanford CS224n)

Problem: Context-free/Atomic Word Representations

We started with context-free approaches like word2vec, GloVE embeddings in my previous post. The drawback of these approaches is that they do not account for syntactic context. e.g. “open a bank account” v/s “on the river bank“. The word bank has different meanings depending on the context the word is used in.

Solution #1: Contextual Word Representations

With ELMo the community started building forward (left to right) and backward (right to left) sequence language models, and used embeddings extracted from both (concatenated) these models as pre-trained embeddings for downstream modeling tasks like classification (Sentiment etc.)

Potential drawback:

ELMo can be considered a “weakly bi-directional model” as they trained 2 separate models here.

Solution #2: Truly bi-directional Contextual Representations

To solve the drawback of “weakly bi-directional” approach and the information bottleneck that comes with LSTMs / Recurrent approaches – the Transformer architecture was developed. Transformers unlike LSTM/RNN are an entirely feedforward network. Here is a quick summary of the architecture:

Tip: If you are new to transformers but are familiar with vanilla Multi-Layer Perceptron (MLP) or Fully connected Neural networks. You can think of transformers as being similar to MLP/standard NN with fancy bells and whistles on top of that.

But, what makes the transformer so much more effective?

2 key ideas:

1. Every word has an opportunity to learn a representation with-respect-to every other word (Truly bi-directional) in the sentence (think of every word as a feature given as input to a fully connected network). To further build on this idea let’s consider the transformer as a fully connected network with 1 hidden layer as shown below:

If x1 and x5 are 2 words/tokens from my earlier example (on the river bank), now x1 has access to x5 regardless of the distance between x1 and x5 (the word on can learn a representation depending on the context provided by the word bank)

2. Essentially, since every layer can be represented as a big matrix multiplication (parallel computation) over one multiplication per token that happens in an LSTM, the transformer is much faster than an LSTM.

Problem with bi-directional models:

But, Language models (LM) are supposed to model P(w_t+1/w_1..w_t)? How does the model learn anything if you expose all the words to it?

BERT develops upon this idea using transformers to learn Masked Language Modeling (MLM) and translates the task to P(w_masked/w_1..w-t)

Tradeoff: In MLM, you could be masking and predicting ~15% words in the sentence. However, in Left-to-Right LM you are predicting 100% of words in the sentence (higher sample efficiency).

There are some changes in the input to the model with respect to the previous LSTM based approach. The input now has 3 embeddings:

1. Token embeddings – (Same as embeddings fed into the LSTM model)

2. Segment Embeddings –

Simply tells the model what sentence does this token belongs to e.g. “Sentence A: The man went to buy milk. Sentence B: The store was closed”.

3. Position Embeddings –

Can be thought as a token number e.g. The – 0, man – 1 and so on.