Attention is all you need

Paper Summary

Paper

Background

Architecture

An autoregressive (AR) based encoder-decoder model is proposed. The input sequence representations are encoded by the encoder into a continous sequence representations. The decoder uses the encoder output to generate model’s output sequences one element at a time. Since the model is AR, it uses the previously generated symbols as additional input when generating the subsequent symbol.

Encoder

Decoder

Attention

Multi-Head attention

Layers

Positional embeddings

Self-attention benefits

References

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • An Image Is Worth 16x16 Words:Transformers For Image Recognition At Scale
  • A Simple Framework for Contrastive Learning of Visual Representations
  • Big Self-Supervised Models are Strong Semi-Supervised Learners
  • Big Self-Supervised Models Advance Medical Image Classification
  • Masked Autoencoders Are Scalable Vision Learners