Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilities in reasoning over complicated interleaved vision-language instructions. It can identify connections between images, infer causes and reasons, understand metaphorical implications, and comprehend absurd objects through multi-modal conversations with humans.
The paper focuses on a vision-language model called InstructBLIP and explores the process of instruction tuning. The authors collect 26 datasets and categorize them for instruction tuning and zero-shot evaluation. They also introduce a method called instruction-aware visual feature extraction. The results show that InstructBLIP achieves the best performance among all models, surpassing BLIP-2 and Flamingo. When fine-tuned for specific tasks, InstructBLIP exhibits exceptional accuracy, such as 90.7% on the ScienceQA IMG task. Through qualitative comparisons, the study highlights InstructBLIP's superiority over other multimodal models, demonstrating its importance in the field of vision-language tasks.
ImageBind, the first AI model capable of binding data from six modalities at once, without the need for explicit supervision. By recognizing the relationships between these modalities — images and video, audio, text, depth, thermal and inertial measurement units (IMUs) — this breakthrough helps advance AI by enabling machines to better analyze many different forms of information, together.
DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image alongside with text. The model is capable of incorporating text into images, generating a high degree of photorealism and the ability to generate images with non-standard aspect ratios. It can also modify style, patterns, and details in images without the need for fine-tuning. DeepFloyd IF is modular, cascaded, and works in pixel space, utilizing diffusion models that inject random noise into data before reversing the process to generate new data samples from the noise.
The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise in improving zero-shot capabilities on tasks in the language domain. The authors introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained multimodal model that combines a vision encoder and LLM for general-purpose visual and language understanding. The model demonstrated impressive multimodal chat abilities in early experiments and yields an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, LLaVA and GPT-4 achieved a new state-of-the-art accuracy of 92.53%.
MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The authors trained MiniGPT-4 using a two-stage process, with the first stage using 5 million aligned image-text pairs for traditional pretraining. To address generation issues, they proposed a novel approach using a small, high-quality dataset and ChatGPT to create high-quality image-text pairs. The second stage involved finetuning the model on this dataset using a conversation template to improve generation reliability and overall usability. The results show that MiniGPT-4 processes capabilities similar to GPT-4, such as detailed image description generation and website creation from handwritten drafts, as well as other emerging capabilities, like writing stories and poems based on images and teaching users how to cook with food photos. The method is computationally efficient and highlights the potential of advanced large language models for vision-language understanding.
Long Stable Diffusion is a pipeline of generative models that can be used to illustrate a full story. Currently, Stable Diffusion can only take in a short prompt, but Long Stable Diffusion can generate images for a long-form text. The process involves starting with a long-form text, asking GPT-3 for several illustration ideas for the beginning, middle, and end of the story, translating the ideas to "prompt-English," and then putting them through Stable Diffusion to generate the images. The images and prompts are then dumped into a .docx file for easy copy-pasting. The purpose of this pipeline is to automate the process of generating illustrations for AI-generated stories.
OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together.
The OpenFlamingo project aims to develop a multimodal system capable of processing and reasoning about images, videos, and text, with the ultimate goal of matching the power and versatility of GPT-4 in handling visual and text input. The project is creating an open-source version of DeepMind's Flamingo model, which is a LMM trained on large-scale web corpora containing interleaved text and images. The OpenFlamingo model implements the same architecture as Flamingo, but is trained on open-source datasets, with the released OpenFlamingo-9B checkpoint trained on 5M samples from the Multimodal C4 dataset and 10M samples from LAION-2B.
The model described in the text is an extension of the Stable Diffusion Image Variations model that incorporates multiple CLIP image embeddings. During training, up to 5 random crops were taken from the training images, and their corresponding CLIP image embeddings were computed and concatenated to be used as the conditioning for the model. At inference time, the model can combine the image embeddings from multiple images to mix their concepts and add text concepts using the text encoder.
The clip-retrieval package allows for easy computing of clip embeddings and building of a clip retrieval system. It can be used to quickly compute image and text embeddings, build efficient indices, filter data, and host these indices with a simple Flask service. The package also includes a simple UI querying system. The clip-retrieval package has been used by cah-prepro to preprocess 400M image+text for the dataset and by other projects such as autofaiss and antarctic-captions. ClipClient allows remote querying of a clip-retrieval backend via python. The package is installable with pip.
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.
DALL-E and DALL-E 2 are deep learning models developed by OpenAI that generate digital images from natural language descriptions. DALL-E can generate imagery in multiple styles and manipulate and rearrange objects in its images. It can infer appropriate details without specific prompts, and exhibit a broad understanding of visual and design trends. DALL-E 2 can produce "variations" of existing images and edit images to modify or expand upon them.
Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. VD can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.
Get an approximate text prompt, with style, matching an image. Optimized for stable-diffusion (clip ViT-L/14)). The resource is an adapted version of the CLIP Interrogator notebook by @pharmapsychotic, which uses OpenAI CLIP models to analyze an image's content and suggest text prompts to create more similar images. The results are combined with BLIP caption to provide suggested text prompts.
Example :
a cat wearing a suit and tie with green eyes, a stock photo by Hanns Katz, pexels, furry art, stockphoto, creative commons attribution, quantum wavetracing
FoldFold allExpandExpand allAre you sure you want to delete this link?Are you sure you want to delete this tag?
The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community