LLaVA: Large Language and Vision Assistanthttps://github.com/haotian-liu/LLaVA
The article presents a novel approach to large multimodal language models using machine-generated instruction-following data, which has shown promise in improving zero-shot capabilities on tasks in the language domain. The authors introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained multimodal model that combines a vision encoder and LLM for general-purpose visual and language understanding. The model demonstrated impressive multimodal chat abilities in early experiments and yields an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, LLaVA and GPT-4 achieved a new state-of-the-art accuracy of 92.53%.