Improving Language Understandingby Generative Pre-Training
Paper Notes
January 2025
Abstract
Paper Notes
Unsupervised Pre-Training
Given unsupervised tokens , the objective is to maximize the likelihood (equivalent to minimizing negative log-likelihood):
where are parameters and is the size of the context window.
They use a Transformer Decoder, using over input context tokens, with then a position-wise feed forward layer to produce an output probability distribution.
where is the positional encoding matrix.
Before the softmax, we compute the vector-matrix product in order to return the logits of the final hidden representation, .
If , where is the angle between and , then .
Therefore, when we multiply each colunm vector of with , we're extracting a similarity metric, where the higher the value of the th number in the output vector, , is, the higher likelihood the next-word is at index .
If represents the tokens for each word as an -dimensional vector, this becomes a reliable way to predict the next word.
Supervised Fine-Tuning
After training the model via the objective , they perform supervised finetuning.
Given a labeled dataset , with input tokens and a label , where can be a sequence or a single index, the model is trained to predict
We add as final lienar layer, , in order to be able to transform the hidden representation, into the proper -dimensional vector which we can feed into . In this case, there are output classes (for finetuning)
As the objective, it's found that maximizing:
alongside as:
was found to improve generalization, as the model is constrained from overfitting to by .
Task-Specific Input Transformations
Textual Entailment: Premise + Hypothesis concatenated into one string (is the hypothesis true or false given the premise:w
a?)
Similarity: Non inherent ordering of two sentences, therefore they conduct multiple inference passes using different sentence orderings -- and process each independently. Then both are concatenated as and fed into as single linear layer for binary classification
Multiple Choice: Given context , question , and answers , they concatenate the document context and question with each posisble answer, adding a delimter "$" token in between.
Each are processed independently and normalized over softmax to get an output distribution over all possible answers.
Experiments
Unsupervised Pre-Training
- BookCorpus Dataset for pre-training -- 7,000+ unpublished books from a variety of genres.
Model Specifications
See Attention
- Trained a 12-layer decoder only transformer with masked self-attention heads.
- 768 dimensional states (original embedding size and typically qkv size) and 12 attention heads (then we have a hidden size of 64)
- Adam Optimizer w max learning rate of .00025
- Learning rate was increased lienarly from over the first 2000 steps and then annealed to 0 using a cosine scheduler.
- Trained for 100 epochs on minibatches of 64 randomly sampled sequences of 512 tokens.
- Weight init of .
- Use BPE with 40,000 merges.
- Residual, Embedding, and Dropout with rate of .1
- Use decoupled weight decay.