The platform for code-switched audio-visual question answering (CSAVQA) integrates English and Hinglish, tackling the challenge of linguistic diversity in AVQA systems. To enhance this, a novel dataset leveraging audio-visual cues was developed for improved context-aware machine comprehension. The system also features a transformer-based framework designed to efficiently process code-switched queries in multiple languages. Furthermore, hierarchical fusion techniques were employed in the model architecture to prioritize modality-specific features, significantly boosting accuracy.