Exploring ASR and Large Audio Language Models for Transcribing Peer Discourse in Noisy Classrooms
DOI:
https://doi.org/10.34190/icair.5.1.4277Keywords:
Automatic Speech Recognition, Peer Dialogue, Collaborative Learning, Large Audio Language Models, Mandarin Classrooms, Educational AIAbstract
Capturing peer discourse in real-world classrooms offers valuable insights into collaborative learning but presents significant technical and pedagogical challenges. While most existing Automatic Speech Recognition (ASR) systems research has focused on teacher-led or online English-speaking environments, peer-to-peer dialogue in noisy, non-English dominant, face-to-face classrooms remains underexplored. This study investigates the feasibility of using both traditional ASR systems—Whisper and Wav2Vec2—and emerging Large Audio Language Models (LALMs), including Qwen2-Audio and Ultravox, to transcribe Mandarin peer conversations recorded via students’ mobile phones in authentic classroom settings. We collected over 105,715 seconds of audio from 38 student groups across two collaborative learning tasks from university classrooms. The manually transcriptions were served as ground truth. Audio quality test of all audio recordings was conducted. Five representative samples with varied signal-to-noise ratios (SNR) and speech ratios were selected to do in-depth analysis. Transcription quality was evaluated using Word Error Rate (WER), Character Error Rate (CER), and Fuzzy String Matching. Additionally, we conducted a thematic analysis of transcription errors to identify linguistic, acoustic, and task-related challenges. Results show that Whisper consistently outperforms other models, achieving high transcription fidelity even in moderately noisy conditions. In contrast, LALMs—despite their strengths in semantic understanding—performed poorly in verbatim transcription, often generating hallucinated or irrelevant content. Importantly, task type and speech characteristics significantly influenced model performance: structured, reflective discussions yielded better results than spontaneous, technical dialogues involving numeric and English domain terms. This study contributes a low-cost, replicable workflow for classroom audio collection and evaluation, along with a detailed taxonomy of transcription errors. We emphasise that our results are exploratory due to the limited sample size. Nevertheless, the findings highlight the current limitations of LALMs for ASR tasks and offers practical recommendations for model selection in educational contexts. Our findings support the responsible integration of ASR technologies into classroom practice, with implications for real-time feedback, collaborative learning analytics, and teacher professional development. For researchers, this work demonstrates the need to consider peer dialogue and multilingual classroom ecologies when evaluating ASR. For teachers, practical recommendations are offered for selecting transcription tools that can support real-time feedback and professional reflection. For lifelong learning, our study illustrates the potential of ASR technologies to make collaborative dialogue more visible, analysable, and actionable across diverse contexts.