Audio Deepfake Generation and Detection using Deep Learning for Digital Forensic Analysis
DOI:
https://doi.org/10.34190/eccws.25.1.4731Keywords:
Audio deepfake detection, Bidirectional LSTM, Conditional GAN, Convolutional neural network, Deep learning, Speech synthesisAbstract
Deepfake audio technology has developed rapidly and now poses serious risks in cybersecurity and digital forensic analysis. Synthetic speech can imitate authentic human voices, which makes manual and automated detection difficult. This study investigates deep learning-based methods for detecting audio deepfakes generated using modern speech synthesis and voice conversion techniques. The work focuses on both generation and detection in order to understand how different deepfake methods affect detection performance. The study employs multiple audio generation models, including Coqui TTS, GAN, RealNVP, VAE, and WaveNet, to generate realistic speech. Several deep learning detection models are evaluated, including CNN, LSTM, BiLSTM, CNN-LSTM, CNN-BiLSTM, conditional GAN, and hybrid architectures that combine convolutional and recurrent layers. Three datasets are used for training and evaluation. These include a self-generated deepfake dataset, a public Kaggle deepfake dataset, and the ASVspoof 2021 deepfake dataset. This combination allows evaluation under both controlled and real-world conditions. Experimental results indicate apparent performance differences among model types. Simple sequential models, such as LSTM and BiLSTM, perform poorly when deepfake audio exhibits strong naturalistic characteristics. This issue is most evident in Coqui TTS-generated audio, which is difficult to detect because of its natural tone and smooth articulation. In comparison, hybrid models that combine convolutional and recurrent learning consistently achieve higher accuracy and stronger generalization across datasets. The Hybrid cGAN with BiLSTM achieves near-perfect detection performance and exhibits stable performance across cross-validation folds and on independent test data. These results confirm that combining spatial and temporal feature learning improves robustness against advanced deepfake attacks. The study also introduces AudioForenX, an interactive forensic tool that integrates the best-performing models. AudioForenX enables real-time analysis, waveform visualization, and classification of audio samples as authentic or synthetic. The findings confirm that hybrid deep learning architectures provide reliable, balanced, and highly accurate detection of synthetic audio. This study contributes practical insights for digital forensics and supports the development of effective tools to counter audio deepfake threats.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 European Conference on Cyber Warfare and Security

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.