LLaVA-Plus introduces a versatile multimodal assistant that extends LLaVA by incorporating a diverse set of external tools for real-world task fulfillment. It maintains a skill repository of vision and vision-language pre-trained models, activating relevant tools based on users' multimodal inputs to compose execution results dynamically. The process involves humans providing a task instruction related to an image, the assistant analyzing the input, selecting and executing a tool, and finally providing the aggregated result.
Composable Diffusion (CoDi) is a new generative model that can create different types of outputs (like language, images, videos, or audio) from various inputs. It can generate multiple outputs at the same time and is not limited to specific types of inputs. Even without specific training data, CoDi aligns inputs and outputs to generate any combination of modalities. It uses a unique strategy to create a shared multimodal space, allowing synchronized generation of intertwined modalities.
ImageBind, the first AI model capable of binding data from six modalities at once, without the need for explicit supervision. By recognizing the relationships between these modalities — images and video, audio, text, depth, thermal and inertial measurement units (IMUs) — this breakthrough helps advance AI by enabling machines to better analyze many different forms of information, together.
FoldFold allExpandExpand allAre you sure you want to delete this link?Are you sure you want to delete this tag?
The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community