Kosmos-G: Generating Images in Context with Multimodal Large Language Modelshttps://xichenpan.com/kosmosg/
Kosmos-G is a model that leverages Multimodal Large Language Models (MLLMs) to generate images in context from generalized vision-language inputs. It aligns the output space of the MLLM with CLIP using the textual modality as an anchor. Kosmos-G's unique capability includes zero-shot multi-entity subject-driven generation and seamless integration with various U-Net techniques. The model consists of an MLLM for multimodal perception and an AlignerNet that connects the MLLM to the image decoder. The training pipeline involves pre-training the MLLM, aligning the image decoder, and fine-tuning through instruction tuning. Kosmos-G demonstrates its effectiveness in diverse image generation tasks and offers potential applications with customized image decoder variants.