Transformer (deep learning architecture)

1.1. Architecture

主要组件:

  • Tokenizers:
  • Embedding layer: convert tokens and positions of the tokens into vector representations
  • Transformer layers: carry out repeated transformations on the vector representations, extracting more and more linguistic information.
    • Alternating attention and feedforward layers.
    • Two major types transformer layers: encoder layers and decoder layers
  • Un-embedding layer: convert the final vector representations back to a probability distribution over the tokens

1.1.1. Embedding

Each token is converted into an embedding vector via a lookup table. Equivalently stated, it multiplies a one-hot representation of the token by an embedding matrix MM.

1.1.2. Un-embedding

Whereas an embedding layer converts a token into a vector, an un-embedding layer converts a vector into a probability distribution over tokens.

The un-embedding layer is a linear-softmax layer: UnEmbed(x)=softmax(xW+b) UnEmbed(x) = softmax(xW+b)

1.1.3. Encoder-decoder (overview)

The original transformer model used an encoder-decoder architecture. The encoder consists of encoding layers that process all the input tokens together one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output and the decoder's output tokens so far.

Both the encoder and decoder layers have a feed-forward neural network for additional processing of their outputs and contain residual connections and layer normalization steps.

1.1.4. Feedforward network

The feedforward network (FFN) modules in a Transformer are 2-layered multilayer perceptrons. 2-layered multilayer perceptrons

1.1.5. Scaled dot-product attention

Attention head

The attention mechanism used in the Transformer architecture are scaled dot-product attention units.

For each vector xi,queryx_{i, query} in the query sequence, it is multiplied by a matrix WQW^Q to produce a query vector qi=xi,queryWQq_i = x_{i, query}W^Q. The matrix of all query vectors is the query matrix: Q=XqueryWQQ = X_{query}W^Q.

Similarly, we construct the key matrix K=XkeyWKK = X_{key}W^K and the value matrix V=XvalueWVV = X_{value}W^V.

It is usually the case that all WQ,WK,WV{\displaystyle W^{Q},W^{K},W^{V}} are square matrices.

Attention weights are calculated using the query and key vectors:the attention weight from token ii to token jj is the dot product between qiq_i and kjk_j. The attention weights are divided by the square root of the dimension of the key vectors, dk{\displaystyle {\sqrt {d_{k}}}}, which stabilizes gradients during training, and passed through a softmax which normalizes the weights

The matrices Q{\displaystyle Q}, K{\displaystyle K} and V{\displaystyle V} are defined as the matrices where the {\displaystyle i}th rows are vectors qi{\displaystyle q_{i}}, ki{\displaystyle k_{i}}, and vi{\displaystyle v_{i}} respectively. Then we can represent the attention as: attention head

The attention mechanism requires the following three equalities to hold: attention mechanism equalities

Scaled dot-product attention, block diagram.

If the attention is used in a self-attention fashion, then Xquery=Xkey=XvalueX_{query} = X_{key} = X_{value}. If the attention is used in a cross-attention fashion, then usually XqueryXkey=XvalX_{query} \neq X_{key} = X_{val}.

Multihead attention

In a transformer model, an attention head consist of three matrices: (W^Q), (W^K), and (W^V), where (W^Q) and (W^K) determine the relevance between tokens for attention scoring, and (W^V) along with (W^O) influences how attended tokens affect subsequent layers and output logits.

Multiple attention heads in a layer allow the model to capture different definitions of "relevance". As tokens progress through layers, the scope of attention can expand, enabling the model to grasp more complex and long-range dependencies.

The outputs from all attention heads are concatenated and fed into the feed-forward neural network layers.

Concretely, let the multiple attention heads be indexed by i{\displaystyle i}, then we have: multihead attention formula where the matrix XX is the concatenation of word embeddings, and the matrices WiQ,WiK,WiV{\displaystyle W_{i}^{Q},W_{i}^{K},W_{i}^{V}} are "projection matrices" owned by individual attention head i{\displaystyle i}, and WO{\displaystyle W^{O}} is a final projection matrix owned by the whole multi-headed attention head.

It is theoretically possible for each attention head to have a different head dimension dhead{\displaystyle d_{\text{head}}}, but that is rarely the case in practice.

As an example, in the smallest GPT-2 model, there are only self-attention mechanisms. It has the following dimensions: demb=768,nhead=12,dhead=64 d_{emb} = 768, n_{head}=12, d_{head}=64 Since, 12×64=768{\displaystyle 12\times 64=768}, its output projection matrix, WOR(12×64)×768{\displaystyle W^{O}\in \mathbb {R} ^{(12\times 64)\times 768}} is a square matrix.

Masked attention

It may be necessary to cut out attention links between some word-pairs. For example, the decoder, when decoding for the token position tt, should not have access to the token at position t+1t+1. This may be accomplished before the softmax stage by adding a mask matrix MM that is {\displaystyle -\infty } at entries where the attention link must be cut, and 0{\displaystyle 0} at other places: masked attention

For example, the following matrix is commonly used in decoder self-attention modules, called "causal masking": Causal masking In words, it means that each token can pay attention to itself, and every token before it, but not any after it.

1.1.6. Encoder

An encoder consists of an embedding layer, followed by multiple encoder layers.

Each encoder layer consists of two major components: a self-attention mechanism and a feed-forward layer.

The encoder layers are stacked. The first encoder layer takes the sequence of input vectors from the embedding layer, producing a sequence of vectors. This sequence of vectors is processed by the second encoder, and so on. The output from the final encoder layer is then used by the decoder.

As the encoder processes the entire input all at once, every token can attend to every other token (all-to-all attention), so there is no need for causal masking.

One encoder layer

1.1.7. Decoder

Each decoder consists of three major components: a causally masked self-attention mechanism, a cross-attention mechanism, and a feed-forward neural network.

The decoder functions use an additional attention mechanism to draw relevant information from the encodings generated by the encoders. This mechanism can also be called the encoder-decoder attention.

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow. Thus, the self-attention module in the decoder is causally masked.

The cross-attention mechanism attends to the output vectors of the encoder, which is computed before the decoder starts decoding. Schematically, we have: cross attention schema where HE{\displaystyle H^{E}} is the matrix with rows being the output vectors from the encoder.

The last decoder is followed by a final un-embedding layer to produce the output probabilities over the vocabulary. Then, one of the tokens is sampled according to the probability, and the decoder can be run again to produce the next token, etc, autoregressively generating output text.

1.2. Full transformer architecture

1.2.1. Sublayers

Each encoder layer contains 2 sublayers: the self-attention and the feedforward network. Each decoder layer contains 3 sublayers: the causally masked self-attention, the cross-attention, and the feedforward network.

The final points of detail are the residual connections and layer normalization (LayerNorm, or LN), which while conceptually unnecessary, are necessary for numerical stability and convergence.

The original 2017 Transformer used the post-LN convention. It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases.

Copyright © 版权信息 all right reserved,powered by Gitbook该文件修订时间: 2024-12-19 16:32:17

results matching ""

    No results matching ""