Banner

World Model

Juan Vera

June 2025

World Model

The world model is a space-time factorized transformer with a hidden dimension of CC.

Where a factorized transformer is a transformer that performs attention over different dimensions of the input, in this case being over TT and H×WH \times W.

Let

XRT×N×H×W×LX \in \mathbb{R}^{T \times N \times H \times W \times L}

be the set of input latent vectors in the form of a tensor, where

  • TT is the temporal window or the number of frames in the input,
  • NN is the number of cameras,
  • HH is the height of the frame,
  • WW is the width of the frame,
  • LL is the dimension of the individual latent vector.

Assume N=1N = 1, then we simplify:

XRT×1×H×W×L.X \in \mathbb{R}^{T \times 1 \times H \times W \times L}.