Parti: Pathways Autoregressive Text-to-Image Model
Introduction
We introduce the Pathways Autoregressive Text-to-Image model (Parti), an autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. Recent advances with diffusion models for text-to-image generation, such as Google’s Imagen, have also shown impressive capabilities and state-of-the-art performance on research benchmarks. Parti and Imagen are complementary in exploring two different families of generative models – autoregressive and diffusion, respectively – opening exciting opportunities for combinations of these two powerful models.
Key Features
Sequence-to-Sequence Modeling
Parti treats text-to-image generation as a sequence-to-sequence modeling problem, analogous to machine translation. This allows it to benefit from advances in large language models, especially capabilities that are unlocked by scaling data and model sizes. The target outputs are sequences of image tokens instead of text tokens in another language.
Image Tokenization
Parti uses the powerful image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens, taking advantage of its ability to reconstruct such image token sequences as high-quality, visually diverse images.
Performance Results
- Model Scaling: Consistent quality improvements by scaling Parti’s encoder-decoder up to 20 billion parameters.
- FID Scores: Achieved state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO.
- Versatility: Effectiveness across a wide variety of categories and difficulty aspects in our analysis on Localized Narratives and PartiPrompts, our new holistic benchmark of 1600+ English prompts.
Model Scaling Insights
Parti is implemented in Lingvo and scaled with GSPMD on TPU v4 hardware for both training and inference, allowing us to train a 20B parameter model that achieves record performance on multiple benchmarks. We perform detailed comparisons of four scales of Parti models – 350M, 750M, 3B, and 20B – and observe consistent and substantial improvements in model capabilities and output image quality.
Example Prompts
- A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!
- A green sign that says "Very Deep Learning" at the edge of the Grand Canyon.
- A photo of an astronaut riding a horse in the forest with a river in front of them.
PartiPrompts Benchmark
PartiPrompts (P2) is a rich set of over 1600 prompts in English that we release as part of this work. P2 can be used to measure model capabilities across various categories and challenge aspects. For example, we created a 67-word description for Vincent van Gogh’s The Starry Night (1889) to gauge the model's performance.
Discussion and Limitations
While Parti produces high-quality outputs for a broad range of prompts, the model has limitations. We discuss these challenges with examples, current failure modes, and opportunities for future work. Notably, the model struggles with negation and complex interactions.
Responsibility and Broader Impact
Text-to-image models introduce many opportunities and risks, particularly concerning bias and safety. We recognize that Parti may encode harmful stereotypes and representations, especially given the biases present in the training data. To mitigate these risks, we have decided not to release our Parti models, code, or data for public use without further safeguards in place.
Conclusion
Parti represents a significant advancement in text-to-image generation, offering unique capabilities that enhance human creativity and productivity. We aim to continue refining the model while addressing ethical considerations and biases to ensure responsible use.
Call to Action
Explore the capabilities of Parti and consider how it can enhance your creative projects. Stay tuned for updates as we work on improving this groundbreaking technology!