LLaVA-Plushttps://llava-vl.github.io/llava-plus/
LLaVA-Plus introduces a versatile multimodal assistant that extends LLaVA by incorporating a diverse set of external tools for real-world task fulfillment. It maintains a skill repository of vision and vision-language pre-trained models, activating relevant tools based on users' multimodal inputs to compose execution results dynamically. The process involves humans providing a task instruction related to an image, the assistant analyzing the input, selecting and executing a tool, and finally providing the aggregated result.