ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generationhttps://github.com/csyxwei/ELITE
The study aims to improve the ability of large text-to-image models to express customized concepts without excessive computation or memory burden. The researchers propose a learning-based encoder consisting of global and local mapping networks. The global mapping network projects the hierarchical features of a given image into multiple "new" words in the textual word embedding space, including one primary word for well-editable concepts and other auxiliary words to exclude irrelevant disturbances. The local mapping network injects the encoded patch features into cross-attention layers to provide omitted details without sacrificing the editability of primary concepts. The study compares the proposed method with prior optimization-based approaches on a variety of user-defined concepts and demonstrates that it enables more high-fidelity inversion and robust editability with a significantly faster encoding process.