A Culturally Aware Multimodal AI Model
Abstract
Vansh Kumar
This paper introduces Vision, a novel 175billion parameter multimodal AI model. Vision is trained from scratch to natively understand text, images, video, and audio and to generate text and images, setting it apart from existing models. Developed with a focus on incorporating Indian context, values, and culture, Vision aims to empower users with a culturally relevant AI experience. A unique security feature allows generated images to be backtracked to Vision, mitigating concerns about potential mis use for misinformation. Evaluations on standard benchmarks demonstrate that Vision achieves stateoftheart performance in a diverse range of tasks, including reasoning, solving mathematical problems, code generation, and image understanding. Furthermore, Vision exhibits remarkable proficiency in multilingual chat, supporting a wide array of global languages as well as regional Indian languages such as Hindi, Punjabi, and Marathi. We believe that Vision represents a significant step towards building more inclusive and culturally relevant AI systems, with the potential to positively impact various domains in India and beyond.