Randomly picked samples of a FloorGenT network trained on the KTH floor plan dataset generated with nucleus sampling (p=90%). Blue is sampled model output. Left: unconditioned novel samples. Middle: partial sequence completion samples conditioned on the segments shown in red (first 25 segments of randomly selected test sequences, i.e., novel data to the network). Right: partial image conditioned samples with the input image shown in red (rasterization of the first 25 segments of randomly selected test sequences).
Overview of data flow in FloorGenT. The input sequence is a possibly empty sequence of tokens t, where each token is embedded as a sum of three discrete embedding vectors, is the input to the first self-attention layer. For the image models, the embedded input image is input to the cross-attention layers. When sampling, the next token is repeatedly drawn from the next token distribution, and fed back into the network at the end of the token sequence.
An example of a token sequence and its corresponding drawing in the shape of an L. Note that in practice, the line segments would be sorted by their distance to some origin coordinate as described in the paper.