Cheetah : Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructionshttps://github.com/DCDmllm/Cheetah
Cheetor, a Transformer-based multi-modal large language model equipped with controllable knowledge re-injection. Cheetor demonstrates strong capabilities in reasoning over complicated interleaved vision-language instructions. It can identify connections between images, infer causes and reasons, understand metaphorical implications, and comprehend absurd objects through multi-modal conversations with humans.