Divyatha M, Gagana C S, Harshini S, Lavitha P, Dr. Vishwesh J
Abstract
Speak2Summarize is a real-time speech processing system designed to convert lengthy audio and video content into clear and concise summaries. The system integrates automated speech recognition, noise reduction, translation, and natural language processing to help users quickly understand long lectures, meetings, and online videos without manually taking notes. It accepts different input formats, including uploaded files, YouTube links, and live recordings, and processes them through a streamlined pipeline that extracts audio, enhances clarity, and converts speech into text.
Using a Whisper-based transcription engine, the system produces accurate text even when the audio contains disturbances or mixed accents. A language detection module identifies the spoken language, and when needed, the transcript is translated before generating summaries. Speak2Summarize offers multiple summary styles to match the user’s preference, such as brief overviews, structured paragraphs, and point-wise summaries. The interface is built to be simple and intuitive, allowing users to upload content, view the generated text, and download the summarized output in an organized report.
This system reduces the time required to review long content and provides a practical solution for students, educators, professionals, and anyone who regularly works with spoken information. By automating transcription and summarization, Speak2Summarize improves productivity, enhances accessibility, and simplifies the process of understanding lengthy audio- visual material.
Keywords
Real-Time Transcription, Speech Recognition, Whisper Model, Natural Language Processing, Audio Preprocessing, Text Summarization, PDF Report Generation, Video-to-Text Conversion
Conclusion
The Speak2Summarize framework was developed to address the growing need for an integrated solution capable of converting spoken content into structured, readable, and concise text in real time. The system successfully combines noise reduction, multilingual transcription, translation, and flexible summarization into a unified workflow supported by a user-friendly interface. Through the implementation of Faster-Whisper for speech recognition and BART-CNN for abstractive summarization, the system demonstrates strong performance in handling recordings of varying quality, duration, and language conditions. The inclusion of multiple summary formats—concise, structured, and bullet-point— enhances usability across academic, corporate, and research contexts where efficient information extraction is essential. The testing results further validate that the pipeline remains stable during noisy input, network failures, and long-duration audio, ensuring reliable operation under realistic conditions. The integrated PDF export, database storage, and retrieval mechanisms extend the system’s relevance by providing users with long-term accessibility to their processed content.
Although the system performs effectively, certain limitations remain. The noise-reduction module uses basic smoothing techniques and may struggle with extremely noisy or echo- rich environments. The summarization quality depends on the clarity of the transcript and may vary when the audio contains slang, overlapping speech, or multilingual switching beyond Hindi and English. Real-time processing is optimized for CPU-based systems, yet heavy workloads may introduce minor delays for longer inputs. Despite these limitations, the system proves to be a practical, accessible, and efficient tool for transforming continuous speech into meaningful written information.
References
[1]Radford et al., “Robust Speech Recognition via Large-Scale Weak Supervision,” arXiv preprint arXiv:2212.04356, pp.1–34, 2022.
[2]Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp.1–67, 2020.
[3]
M. Lewis, Y. Liu, N. Goyal et al., “BART: Denoising Sequence- to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), ACL Press, USA, pp. 7871–7880, 2020.
[4]J. Zhang, V. Sanh, L. H. Beauchamp, T. Wolf, “Improving Neural Machine Translation with Multi-task Learning and Pre-trained Language Models,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 127–144, 2021.
[5]Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, USA, pp.12449–12460, 2020.
[6]H. Yu, Q. Xu, and L. Xie, “A Review of Neural Speech Enhancement Techniques,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1–17, 2022.
[7]P. Denisov, N. Ott, A. Tjandra, and H. Li, “Faster-Whisper: Real- Time Speech Recognition with Optimized Whisper Models,” arXiv preprint arXiv:2306.11015, pp.1–10, 2023.
[8]S. Narayan, S. B. Cohen, and M. Lapata, “Don’t Give Me the Details, Just the Summary! Topic-Aware Summarization for Long Documents,” in Proc. 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), ACL Press, pp. 674– 688, 2021.
[9]S. Wang, A. Mohamed, and D. Le, “Transformer-Based Models for Speech Recognition: A Survey,” IEEE Access, vol. 10, pp. 11135–11157, 2022.
[10]T. Kudo and J. Richardson, “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer,” in Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, ACL Press, pp. 66–71, 2018.