Parti: Pathways Autoregressive Text-to-Image Model
Parti is an autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. It treats text-to-image generation as a sequence-to-sequence modeling problem, similar to machine translation, allowing it to benefit from advancements in large language models.
The model uses the powerful image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens and can reconstruct these image token sequences as high-quality, visually diverse images. By scaling Parti's encoder-decoder up to 20 billion parameters, consistent quality improvements are observed. It achieves a state-of-the-art zero-shot FID score of 7.23 and a finetuned FID score of 3.22 on MS-COCO.
Parti is implemented in Lingvo and scaled with GSPMD on TPU v4 hardware for both training and inference. Detailed comparisons of four scales of Parti models – 350M, 750M, 3B, and 20B – show consistent and substantial improvements in model capabilities and output image quality. Human evaluators prefer the 20B model in most cases for image realism/quality and image-text match.
The model can manage long, complex prompts that require it to accurately reflect world knowledge, compose many participants and objects with fine-grained details and interactions, and adhere to a specific image format and style. Examples of prompts and output images demonstrate how Parti responds to changes in various aspects.
PartiPrompts (P2) is a rich set of over 1600 prompts in English that can be used to measure model capabilities across various categories and challenge aspects.
However, while Parti produces high-quality outputs for a broad range of prompts, it has limitations. Images shown are often selected from a large set of examples, and the model may encode harmful stereotypes and representations. Current models like Parti are trained on large, often noisy, image-text datasets that contain biases. This leads to stereotypical representations and reflects Western biases. Models that produce photorealistic outputs pose additional risks around the creation of deepfakes and the propagation of visually-oriented misinformation.
Despite these challenges, text-to-image models like Parti open up new possibilities for creating unique and aesthetically pleasing images, enhancing human creativity and productivity. The researchers are working on bias measurement and mitigation strategies, and plan to coordinate with artists to adapt the model's capabilities to their work.