A transformer is trained autoregressively to model images and text as a single stream of data. Their method consists of a discrete VAE (dVAE) and an autoregressive transformer.
A discrete VAE is trained to compress 256x256x3 image into 32x32 grid of image tokens. The tokens are chosen from amongst 8192 (vocabulary size) different tokens (vectors). Directly using pixels is computationally infeasible which motivates the use of dVAE. However, due to this, dVAE results tend to be blurry since it cannot model high frequency details very well. Image is encoded into discrete tokens which are then used in image reconstruction.
256 text tokens and 1024 (32x32) image tokens are concatenated over which a transformer is trained in an autoregressive manner. In other words, given the text tokens, the first image token is sampled which is fed to the input of the transformer along with the earlier input and then the second image token is sampled and so on. This is repeated until all the 1024 tokens are generated. From an initial 256 tokens, the model will autocomplete the remaining image tokens which form the data stream, from which the generated image can be rendered.
The generated 1024 image tokens (this forms the latent indices) are looked up using the codebook and the resulting codebook vectors are fed to the dVAE decoder to generate the image.
By sampling new latent sequences from the output of the transformer, new images can be generated by the dVAE decoder.
Unlike VQ-VAE, which picks one vector deterministically from the codebook (usually the closest l2 distance to the encoded latent), dVAE encoder outputs a distribution over codebook vector for each latent.
Since discrete sampling is non-differentiable, the method employ Gumbel softmax relaxation. This relaxes the discrete sampling problem to a continuous approximation.
Initially, an ELBO objective (standard VAE objective) is optimized i.e. the image reconstruction and the encoding following the Gumbel softmax relaxation. The latent distribution prior is taken to be a uniform distribution over the codebook vectors.
In the second stage, dVAE encoder and decoder is fixed and the prior over the latents is learned using the autoregressive transformer described earlier using image-text pairs.
The transformer is a decoder only model where each image token can attend to any text tokens in its self-attention layers.
During sample generation, they use their contrastive learning model CLIP which assigns score to a given caption and image pair that tells how well they match. Based on this, the best match is outputted.