The rise of video content on platforms like YouTube has created a need for efficient multilingual video summarization. This paper introduces ZETA (Zonal Encoding for Transformer Adaptation), a tool that combines transformer models, retrieval-augmented generation (RAG), and graph-based keyword extraction for accurate, scalable summaries. ZETA uses Whisper for multilingual transcription and integrates both abstractive (e.g. PEGASUS) and extractive (e.g., BERTSUM) summarization. RAG enhances speed and accuracy by retrieving relevant transcript sections, while graph-based keyword extraction (e.g., YAKE, Gensim) highlights key topics, enabling quick access to essential insights. Keywords Multilingual Summarization, Automatic Speech Recognition, Transformer Models, Retrieval-Augmented Generation, YouTube, RAG, Graph-Based Approach