Delete Set public Set private Add tags Delete tags
  Add tag   Cancel
  Delete tag   Cancel
  • • Curated knowledge about art and AI •
  •  
  • About
  • Lora
  • Prompts
  • Tags
  • Login
14 results tagged image2text

Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructionshttps://github.com/DCDmllm/Cheetah

  • image2text
  • llm
  • image2text
  • llm

Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilities in reasoning over complicated interleaved vision-language instructions. It can identify connections between images, infer causes and reasons, understand metaphorical implications, and comprehend absurd objects through multi-modal conversations with humans.

2 months ago Permalink
cluster icon
  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models : MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The author...
  • LLaVA: Large Language and Vision Assistant : The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise ...
  • OpenFlamingo-9B Demo : OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together. The OpenFlamingo project aims to develop a multi...
  • Instructblip : The paper focuses on a vision-language model called InstructBLIP and explores the process of instruction tuning. The authors collect 26 datasets and c...
  • Clip retrieval - converting the text query to a CLIP embedding : The clip-retrieval package allows for easy computing of clip embeddings and building of a clip retrieval system. It can be used to quickly compute ima...

Instructbliphttps://github.com/salesforce/LAVIS/tree/main/projects/instructblip

  • image2text
  • llm
  • image2text
  • llm

The paper focuses on a vision-language model called InstructBLIP and explores the process of instruction tuning. The authors collect 26 datasets and categorize them for instruction tuning and zero-shot evaluation. They also introduce a method called instruction-aware visual feature extraction. The results show that InstructBLIP achieves the best performance among all models, surpassing BLIP-2 and Flamingo. When fine-tuned for specific tasks, InstructBLIP exhibits exceptional accuracy, such as 90.7% on the ScienceQA IMG task. Through qualitative comparisons, the study highlights InstructBLIP's superiority over other multimodal models, demonstrating its importance in the field of vision-language tasks.

Instructblip

Demo

6 months ago Permalink
cluster icon
  • OpenFlamingo-9B Demo : OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together. The OpenFlamingo project aims to develop a multi...
  • Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions : Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilit...
  • LLaVA: Large Language and Vision Assistant : The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise ...
  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models : MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The author...
  • Stable-diffusion-webui-chatgpt-utilities: Enables use of ChatGPT directly from the UI : This an extension for stable-diffusion-webui that enables you to use ChatGPT for prompt variations, inspiration.

ImageBind by Meta AIhttps://imagebind.metademolab.com/

  • image2text
  • multimodal
  • image2text
  • multimodal

ImageBind, the first AI model capable of binding data from six modalities at once, without the need for explicit supervision. By recognizing the relationships between these modalities — images and video, audio, text, depth, thermal and inertial measurement units (IMUs) — this breakthrough helps advance AI by enabling machines to better analyze many different forms of information, together.

7 months ago Permalink
cluster icon
  • IF by DeepFloyd Lab : DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image...
  • Clip retrieval - converting the text query to a CLIP embedding : The clip-retrieval package allows for easy computing of clip embeddings and building of a clip retrieval system. It can be used to quickly compute ima...
  • Long-form text-to-images generation (GPT-3 and Stable Diffusion) : Long Stable Diffusion is a pipeline of generative models that can be used to illustrate a full story. Currently, Stable Diffusion can only take in a s...
  • Img2prompt : Get an approximate text prompt, with style, matching an image. Optimized for stable-diffusion (clip ViT-L/14)). The resource is an adapted version of ...
  • DALL·E 2 : DALL-E and DALL-E 2 are deep learning models developed by OpenAI that generate digital images from natural language descriptions. DALL-E can generate ...

IF by DeepFloyd Labhttps://github.com/deep-floyd/IF

  • stable_diffusion
  • text2image
  • image2text
  • image_generation
  • stable_diffusion
  • text2image
  • image2text
  • image_generation

DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image alongside with text. The model is capable of incorporating text into images, generating a high degree of photorealism and the ability to generate images with non-standard aspect ratios. It can also modify style, patterns, and details in images without the need for fine-tuning. DeepFloyd IF is modular, cascaded, and works in pixel space, utilizing diffusion models that inject random noise into data before reversing the process to generate new data samples from the noise.

deep floyd

Demo

7 months ago Permalink
cluster icon
  • AUTOMATIC1111 Stable Diffusion web UI : The Stable Diffusion WebUI offers a range of features for generating and processing images, including original txt2img and img2img modes, outpainting,...
  • Paella: Simple & Efficient Text-To-Image generation : Paella is an easy-to-use text-to-image model that can turn text into pictures. It was inspired by earlier models but has simpler code for training and...
  • Long-form text-to-images generation (GPT-3 and Stable Diffusion) : Long Stable Diffusion is a pipeline of generative models that can be used to illustrate a full story. Currently, Stable Diffusion can only take in a s...
  • DALL·E 2 : DALL-E and DALL-E 2 are deep learning models developed by OpenAI that generate digital images from natural language descriptions. DALL-E can generate ...
  • ComfyUI: A stable diffusion GUI with a graph/nodes interface : ComfyUI is a powerful and modular stable diffusion GUI and backend that enables users to design and execute advanced stable diffusion pipelines using ...

LLaVA: Large Language and Vision Assistanthttps://github.com/haotian-liu/LLaVA

  • image2text
  • llm
  • image2text
  • llm

The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise in improving zero-shot capabilities on tasks in the language domain. The authors introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained multimodal model that combines a vision encoder and LLM for general-purpose visual and language understanding. The model demonstrated impressive multimodal chat abilities in early experiments and yields an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, LLaVA and GPT-4 achieved a new state-of-the-art accuracy of 92.53%.

LLaVA: Large Language and Vision Assistant

Demo

7 months ago Permalink
cluster icon
  • Instructblip : The paper focuses on a vision-language model called InstructBLIP and explores the process of instruction tuning. The authors collect 26 datasets and c...
  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models : MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The author...
  • Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions : Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilit...
  • OpenFlamingo-9B Demo : OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together. The OpenFlamingo project aims to develop a multi...
  • MiniChain: A tiny library for coding with large language models. : MiniChain is a library used to link prompts together in a sequence, with the ability to manipulate and visualize them using Gradio. Users can ensure t...

MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Modelshttps://github.com/Vision-CAIR/MiniGPT-4

  • image2text
  • llm
  • image2text
  • llm

MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The authors trained MiniGPT-4 using a two-stage process, with the first stage using 5 million aligned image-text pairs for traditional pretraining. To address generation issues, they proposed a novel approach using a small, high-quality dataset and ChatGPT to create high-quality image-text pairs. The second stage involved finetuning the model on this dataset using a conversation template to improve generation reliability and overall usability. The results show that MiniGPT-4 processes capabilities similar to GPT-4, such as detailed image description generation and website creation from handwritten drafts, as well as other emerging capabilities, like writing stories and poems based on images and teaching users how to cook with food photos. The method is computationally efficient and highlights the potential of advanced large language models for vision-language understanding.

MiniGPT-4

demo : Link1 Link2 Link3 Link4 Link5 Link6 Link7

7 months ago Permalink
cluster icon
  • LLaVA: Large Language and Vision Assistant : The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise ...
  • Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions : Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilit...
  • Instructblip : The paper focuses on a vision-language model called InstructBLIP and explores the process of instruction tuning. The authors collect 26 datasets and c...
  • OpenFlamingo-9B Demo : OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together. The OpenFlamingo project aims to develop a multi...
  • OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities : OlaGPT is a newly developed framework that enhances large language models by simulating human-like problem-solving abilities. It incorporates six cogn...

Long-form text-to-images generation (GPT-3 and Stable Diffusion)https://github.com/sharonzhou/long_stable_diffusion

  • stable_diffusion
  • gpt
  • text2image
  • image2text
  • stable_diffusion
  • gpt
  • text2image
  • image2text

Long Stable Diffusion is a pipeline of generative models that can be used to illustrate a full story. Currently, Stable Diffusion can only take in a short prompt, but Long Stable Diffusion can generate images for a long-form text. The process involves starting with a long-form text, asking GPT-3 for several illustration ideas for the beginning, middle, and end of the story, translating the ideas to "prompt-English," and then putting them through Stable Diffusion to generate the images. The images and prompts are then dumped into a .docx file for easy copy-pasting. The purpose of this pipeline is to automate the process of generating illustrations for AI-generated stories.

long

7 months ago Permalink
cluster icon
  • IF by DeepFloyd Lab : DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image...
  • Null-text Inversion for Editing Real Images using Guided Diffusion Models : The paper introduces a technique for text-based editing of real images using the Stable Diffusion model. In order to modify an image using text, it mu...
  • Implementation of Paint-with-words with Stable Diffusion : The author of the article discusses implementing the "painting with word" method proposed by researchers from NVIDIA, called eDiffi, with Stable Diffu...
  • Stable Diffusion Wildcards Collection : A collection of wildcards to use with Stable Diffusion
  • Cutting Off Prompt Effect : This stable-diffusion-webui extension aims to limit the influence of certain tokens in language models by rewriting them as padding tokens. This is im...

OpenFlamingo-9B Demohttps://7164d2142d11.ngrok.app/

  • gradio
  • image2text
  • llm
  • llama
  • gradio
  • image2text
  • llm
  • llama

OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together.


The OpenFlamingo project aims to develop a multimodal system capable of processing and reasoning about images, videos, and text, with the ultimate goal of matching the power and versatility of GPT-4 in handling visual and text input. The project is creating an open-source version of DeepMind's Flamingo model, which is a LMM trained on large-scale web corpora containing interleaved text and images. The OpenFlamingo model implements the same architecture as Flamingo, but is trained on open-source datasets, with the released OpenFlamingo-9B checkpoint trained on 5M samples from the Multimodal C4 dataset and 10M samples from LAION-2B.

https://laion.ai/blog/open-flamingo/

7 months ago Permalink
cluster icon
  • 0obabooga/text-generation-webui: A gradio web UI for running Large Language Models : A text generation web UI built on Gradio that can run large language models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA. The UI features ...
  • Koala: A Dialogue Model for Academic Research : Koala is a new model fine-tuned on freely available interaction data scraped from the web, with a specific focus on data that includes interaction wit...
  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models : MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The author...
  • Instructblip : The paper focuses on a vision-language model called InstructBLIP and explores the process of instruction tuning. The authors collect 26 datasets and c...
  • Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions : Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilit...

Image Mixerhttps://lambdalabs-image-mixer-demo.hf.space/?__theme=dark

  • gradio
  • image2image
  • image2text
  • gradio
  • image2image
  • image2text

The model described in the text is an extension of the Stable Diffusion Image Variations model that incorporates multiple CLIP image embeddings. During training, up to 5 random crops were taken from the training images, and their corresponding CLIP image embeddings were computed and concatenated to be used as the conditioning for the model. At inference time, the model can combine the image embeddings from multiple images to mix their concepts and add text concepts using the text encoder.

image mixer

7 months ago Permalink
cluster icon
  • UnClip Image Interpolation Pipeline : Interpolate between two images
  • Versatile Diffusion : Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. VD can natively sup...
  • OpenFlamingo-9B Demo : OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together. The OpenFlamingo project aims to develop a multi...
  • CoDeF: Content Deformation Fields for Temporally Consistent Video Processing : CoDeF is a new video representation involving a canonical content field and a temporal deformation field. The canonical content field captures the sta...
  • Cones: Concept Neurons in Diffusion Models for Customized Generation : The study explores if modern deep neural networks exhibit similar patterns to human brains in responding to semantic features of presented stimuli wit...

Clip retrieval - converting the text query to a CLIP embeddinghttps://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion5B-H-14&useMclip=false

  • prompt
  • image2text
  • dataset
  • search
  • tool
  • prompt
  • image2text
  • dataset
  • search
  • tool

The clip-retrieval package allows for easy computing of clip embeddings and building of a clip retrieval system. It can be used to quickly compute image and text embeddings, build efficient indices, filter data, and host these indices with a simple Flask service. The package also includes a simple UI querying system. The clip-retrieval package has been used by cah-prepro to preprocess 400M image+text for the dataset and by other projects such as autofaiss and antarctic-captions. ClipClient allows remote querying of a clip-retrieval backend via python. The package is installable with pip.

clip

https://github.com/rom1504/clip-retrieval

7 months ago Permalink
cluster icon
  • Have I Been Trained? : HaveIBeenTrained is a tool that uses clip retrieval to search the largest public text-to-image datasets, Laion-5B and Laion-400M, to remove links to i...
  • Img2prompt : Get an approximate text prompt, with style, matching an image. Optimized for stable-diffusion (clip ViT-L/14)). The resource is an adapted version of ...
  • Gpt-prompt-engineer : The gpt-prompt-engineer tool is a powerful solution for prompt engineering, enabling users to experiment and find the optimal prompt for GPT-4 and GPT...
  • Random Drawing Prompt Generator : The random drawing prompt generator provides users with easy drawing ideas by generating a stream of random prompts. The generator is not based on AI ...
  • NeoPrompt Pro - Smart prompt generation tool for AI arts : NeoPrompt is a tool designed to make the creation of AI art more accessible and less time-consuming by providing a comprehensive framework for generat...

Daam: Diffusion attentive attribution maps for interpreting Stable Diffusion.https://github.com/castorini/daam

  • stable_diffusion
  • image2text
  • plugin
  • stable_diffusion
  • image2text
  • plugin

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.

daam

7 months ago Permalink
cluster icon
  • The Stability Photoshop Plugin : The Stability Photoshop plugin enables users to generate and edit images using both Stable Diffusion and DALL•E 2 directly within Photoshop. The plugi...
  • Prompt translate script for AUTOMATIC1111/stable-diffusion-webui : 'Prompt translate' script for AUTOMATIC1111/stable-diffusion-webui translate prompt. This script allows you to write a query in promt query in your na...
  • ControlNet extension for AUTOMATIC1111's Stable Diffusion web UI : ControlNet, a neural network structure that adds extra conditions to diffusion models to control them. ControlNet copies the weights of neural network...
  • Tagger - Script for AUTOMATIC1111/stable-diffusion-webui : A script for AUTOMATIC1111/stable-diffusion-webui that allows users to quickly add tags from a list to their prompt. The script adds a separate textbo...
  • ControlNetMediaPipeFace : This dataset is designed to train a ControlNet with human facial expressions. It includes keypoints for pupils to allow gaze direction. Training has b...

DALL·E 2https://openai.com/dall-e-2/

  • image2text
  • openai
  • image_generation
  • image2text
  • openai
  • image_generation

DALL-E and DALL-E 2 are deep learning models developed by OpenAI that generate digital images from natural language descriptions. DALL-E can generate imagery in multiple styles and manipulate and rearrange objects in its images. It can infer appropriate details without specific prompts, and exhibit a broad understanding of visual and design trends. DALL-E 2 can produce "variations" of existing images and edit images to modify or expand upon them.

7 months ago Permalink
cluster icon
  • IF by DeepFloyd Lab : DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image...
  • Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions : Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilit...
  • UniControl: Unified Controllable Visual Generation Model : UniControl is a new generative model that combines many condition-to-image (C2I) tasks into one framework while still allowing for language prompts. T...
  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models : MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The author...
  • Daam: Diffusion attentive attribution maps for interpreting Stable Diffusion. : Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking inter...

Versatile Diffusionhttps://shi-labs-versatile-diffusion.hf.space/?__theme=light

  • gradio
  • generator
  • image2text
  • text2image
  • gradio
  • generator
  • image2text
  • text2image

Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. VD can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.

Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi [arXiv] [GitHub]

7 months ago Permalink
cluster icon
  • Hard Prompts Made Easy (PEZ) : The strength of modern generative models lies in their ability to be controlled through text-based prompts. Typical "hard" prompts are made from inte...
  • OpenFlamingo-9B Demo : OpenFlamingo is a new tool that helps computers learn how to understand pictures and words together. The OpenFlamingo project aims to develop a multi...
  • Long-form text-to-images generation (GPT-3 and Stable Diffusion) : Long Stable Diffusion is a pipeline of generative models that can be used to illustrate a full story. Currently, Stable Diffusion can only take in a s...
  • IF by DeepFloyd Lab : DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image...
  • 1111101000 Robots - Ben Barry / A book of 1000 paintings and illustrations of robots created by artificial intelligence. : A book of 1000 paintings and illustrations of robots created by artificial intelligence. The author generated all of the images in this book by writin...

Img2prompthttps://replicate.com/methexis-inc/img2prompt

  • image2text
  • standalone
  • prompt
  • tool
  • capturing_concepts
  • image2text
  • standalone
  • prompt
  • tool
  • capturing_concepts

Get an approximate text prompt, with style, matching an image. Optimized for stable-diffusion (clip ViT-L/14)). The resource is an adapted version of the CLIP Interrogator notebook by @pharmapsychotic, which uses OpenAI CLIP models to analyze an image's content and suggest text prompts to create more similar images. The results are combined with BLIP caption to provide suggested text prompts.

Example :

a cat wearing a suit and tie with green eyes, a stock photo by Hanns Katz, pexels, furry art, stockphoto, creative commons attribution, quantum wavetracing

7 months ago Permalink
cluster icon
  • Clip retrieval - converting the text query to a CLIP embedding : The clip-retrieval package allows for easy computing of clip embeddings and building of a clip retrieval system. It can be used to quickly compute ima...
  • Random Drawing Prompt Generator : The random drawing prompt generator provides users with easy drawing ideas by generating a stream of random prompts. The generator is not based on AI ...
  • CLIP Interrogator 2.1 : Want to figure out what a good prompt might be to create new images like an existing one? The CLIP Interrogator is here to get you answers! This vers...
  • NeoPrompt Pro - Smart prompt generation tool for AI arts : NeoPrompt is a tool designed to make the creation of AI art more accessible and less time-consuming by providing a comprehensive framework for generat...
  • Gpt-prompt-engineer : The gpt-prompt-engineer tool is a powerful solution for prompt engineering, enabling users to experiment and find the optimal prompt for GPT-4 and GPT...


(175)
Links per page
  • 20
  • 50
  • 100
Filter untagged links

 

 

 
Fold Fold all Expand Expand all Are you sure you want to delete this link? Are you sure you want to delete this tag? The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community