The SD-CN-Animation project offers automated video stylization and text-to-video generation using StableDiffusion and ControlNet. It provides the ability to stylize videos automatically and generate new videos from text input, using various Stable Diffusion models as backbones. The project incorporates the 'RAFT' optical flow estimation algorithm to maintain animation stability and generate occlusion masks for frame generation. In text-to-video mode, it utilizes the 'FloweR' method for predicting optical flow from previous frames. The ControlNet model is recommended for better results in vid2vid mode.
ImageBind, the first AI model capable of binding data from six modalities at once, without the need for explicit supervision. By recognizing the relationships between these modalities — images and video, audio, text, depth, thermal and inertial measurement units (IMUs) — this breakthrough helps advance AI by enabling machines to better analyze many different forms of information, together.
UnpromptedControl is a tool used for guiding StableDiffusion models in image restoration and object removal tasks. By leveraging a simple hack, it allows for the restoration or removal of objects without requiring user prompts, leading to enhanced process efficiency. The tool uses ControlNet and StableDiffusionInpaintPipeline models to guide the inpainting process and restore the image to a more natural-looking state. However, the algorithm currently has limitations in processing images of people's faces and bodies.
Promptsandbox.io is a node-based visual canvas used for creating chatbots powered by OpenAI APIs. The platform has an intuitive drag-and-drop interface for creating dynamic chains of nodes that perform specific operations as part of the workflow. It is built using React and has various features such as integration with OpenAI APIs, document upload and retrieval, support for various block types, debugging tools, a chatbot gallery, and extensibility with additional node types. Promptsandbox.io provides a seamless experience for users to work with OpenAI APIs and build more complex chatbots.
An experiment where an autonomous GPT (Generative Pre-trained Transformer) agent is given access to a browser to perform tasks. For example adding text to a webpage and making restaurant reservations. gpt-assistant requires Node.js, an OpenAI API key, and a Postgres database.
DeepFloyd IF is a text-to-image model that utilizes the large language model T5-XXL-1.1 as a text encoder to generates intelligible and coherent image alongside with text. The model is capable of incorporating text into images, generating a high degree of photorealism and the ability to generate images with non-standard aspect ratios. It can also modify style, patterns, and details in images without the need for fine-tuning. DeepFloyd IF is modular, cascaded, and works in pixel space, utilizing diffusion models that inject random noise into data before reversing the process to generate new data samples from the noise.
Inpaint Anything is an innovative tool that seamlessly inpaints images, videos, and 3D scenes by allowing users to remove, fill, or replace objects with just a few clicks. It leverages advanced vision models like Segment Anything Model (SAM), LaMa, and Stable Diffusion (SD) to achieve these tasks. With support for multiple aspect ratios and resolutions up to 2K, Inpaint Anything offers a user-friendly interface for various modalities, including images, videos, and 3D scenes. The tool is continuously improving with new features and functionalities, making it an accessible and powerful solution for users seeking advanced inpainting capabilities.
WizardLM is a pre-trained language model that can follow complex instructions using Evol-Instruct - a method that uses language models instead of humans to automatically produce open-domain instructions of various difficulty levels. WizardLM is still in development and will continue to improve by training on larger scales, adding more training data, and innovating more advanced large-model training methods. To fine-tune WizardLM model, alpaca_evol_instruct_70k.json containing 70K instruction-following data generated from Evol-Instruct was used. In terms of human evaluation, WizardLM achieved significantly better results than Alpaca and Vicuna-7b models on diverse user-oriented instructions including difficult coding generation, debugging, math, reasoning, complex formats, academic writing, and extensive disciplines. Additionally, in the high-difficulty section of the human evaluation test set, WizardLM even outperforms ChatGPT, indicating its significant potential to handle complex instructions.
Hugging Face, a company and AI community that provides free open source tools for machine learning and AI apps, has released HuggingChat, an open source ChatGPT clone that is available for anyone to use or download. The app is based on the Open Assistant Conversational AI Model by Large-scale Artificial Intelligence Open Network (LAION), a global non-profit organization dedicated to democratizing ML research and its applications. HuggingChat was trained with the OpenAssistant Conversations Dataset (OASST1), which was collected up to April 12, 2023, and used reinforcement learning from human feedback methodology to create a high quality human-annotated dataset. The dataset is the product of a worldwide crowdsourcing effort by over 13,000 volunteers.
Token Merging (ToMe) is a technique used to speed up transformers by merging redundant tokens, which helps reduce the workload for the transformer without compromising quality. The technique is applied to the underlying transformer blocks in Stable Diffusion, minimizing quality loss while preserving the speed-up and memory benefits. It works without training and can be used for any Stable Diffusion model, reducing the workload by up to 60%. ToMe for SD is not another efficient reimplementation of transformer modules, but an actual reduction of the total workload required to generate an image. The results of ToMe for SD show that it produces images similar to the originals, while being faster and using less memory, making it an efficient tool for image generation.
Safe & Stable is a user-friendly tool designed to convert stable diffusion checkpoint files (.ckpt) to the safer and more secure .safetensors format for tensor storage. This new format enhances security by preventing malicious Python code while improving performance during model loading on both CPUs and GPUs. The tool's graphical interface simplifies file selection and monitors conversion progress. Although the initial conversion still requires .ckpt data, future models will be distributed exclusively in the .safetensors format, eliminating the need for scanning or converting from potentially harmful pickle files.
Making sure that diffusion model-generated images are safe from undesirable content and copyrighted material is a serious concern. Previous methods for stopping this content could be easily bypassed, but this new technique can fine-tune the model weights to erase concepts permanently, without having to retrain the entire model. The technique works by using the model's knowledge to guide the output away from the targeted concept. The authors have tested this technique on artistic style erasure and erasing nudity from images, and found it to be more effective than previous methods at erasing targeted concepts. However, for large concepts, there may be a trade-off between complete erasure and interference with other concepts.
Paella is an easy-to-use text-to-image model that can turn text into pictures. It was inspired by earlier models but has simpler code for training and sampling. During training, it "noises" images by randomly replacing visual elements with others from a library, and then tries to predict the original elements. During sampling, the model creates a distribution over each element and then selects one at random to build up the final image. Paella is designed to make text-to-image models more accessible to non-experts in the field by simplifying the technical components.
LocalAI is an API that can be used as a replacement for OpenAI, which supports various models and can run on consumer-grade hardware. It supports ggml compatible models such as LLaMA, alpaca, gpt4all, vicuna, koala, gpt4all-j, cerebras. It uses C bindings for faster inference and performance and comes as a container image that can be run with docker-compose. The API can be used for running text generation as a service, following the OpenAI reference.
MiniChain is a library used to link prompts together in a sequence, with the ability to manipulate and visualize them using Gradio. Users can ensure the prompt output matches specific criteria through the use of data classes and typed prompts. The library does not manage documents or provide tools, but suggests using the Hugging Face Datasets library for that purpose. Additionally, users can include their own backends.
The AutoGPT website offers an AI buddy that can be set up with initial roles and goals without the requirement of human supervision. This AI tool automatically leverages all available resources to achieve your set goal. The tool is inspired by Auto-GPT and features internet access for information gathering and searches. It also allows users to save chat history, credentials, and definition of AI directly in the browser.
The paper explores the potential of developing autonomous cooperation for conversational language models without relying heavily on human input. The proposed framework named role-playing utilizes inception prompting to direct chat agents toward tasks that align with human intentions while enhancing consistency. The role-playing framework produces conversational data for investigating the behaviors and capabilities of language models, resulting in a valuable resource for studying conversational language models. The authors' contributions include the introduction of a novel communicative agent framework, offering a scalable approach for investigating multi-agent systems, and making the library available for further research on communicative agents.
The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise in improving zero-shot capabilities on tasks in the language domain. The authors introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained multimodal model that combines a vision encoder and LLM for general-purpose visual and language understanding. The model demonstrated impressive multimodal chat abilities in early experiments and yields an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, LLaVA and GPT-4 achieved a new state-of-the-art accuracy of 92.53%.
MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen large language model (LLM) using one projection layer. The authors trained MiniGPT-4 using a two-stage process, with the first stage using 5 million aligned image-text pairs for traditional pretraining. To address generation issues, they proposed a novel approach using a small, high-quality dataset and ChatGPT to create high-quality image-text pairs. The second stage involved finetuning the model on this dataset using a conversation template to improve generation reliability and overall usability. The results show that MiniGPT-4 processes capabilities similar to GPT-4, such as detailed image description generation and website creation from handwritten drafts, as well as other emerging capabilities, like writing stories and poems based on images and teaching users how to cook with food photos. The method is computationally efficient and highlights the potential of advanced large language models for vision-language understanding.
Simple LLM Finetuner is a beginner-friendly interface designed to make it easy to fine-tune various language models using LoRA method via the PEFT library . With this intuitive UI, users can easily manage datasets, customize parameters, train, and evaluate the model's inference capabilities. Users can paste datasets in the UI and specify the new LoRA adapter name in the "New PEFT Adapter Name" textbox, adjust the max sequence length and batch size to fit their GPU memory, and click train. The model will be saved in the lora/ directory. The UI is beginner-friendly and has explanations for each parameter. The requirements are Linux or WSL and modern NVIDIA GPU with >= 16 GB of VRAM.
FoldFold allExpandExpand allAre you sure you want to delete this link?Are you sure you want to delete this tag?
The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community