Transformer-Based File Fragment Type Classification for File Carving in Digital Forensics

Authors

DOI:

https://doi.org/10.34190/eccws.24.1.3552

Keywords:

Digital Forensics, File Carving, File Fragment Classification, Data Fragmentation, Transformer Models, Cybercrime Investigations

Abstract

The recovery and reconstruction of fragmented data is a critical challenge in digital forensics, particularly when dealing with incomplete, corrupted, or partially deleted files in large-scale cybercrime investigations. Accurate classification of file fragment types is essential for reconstructing critical evidence, especially in environments characterized by high levels of data fragmentation, such as cyberattacks, data breaches, and the operation of illicit (“darknet”) data centers. Traditional file carving methods often struggle to efficiently handle these fragmented files, limiting their reliability in complex investigations involving large volumes of data. This paper introduces a novel approach to classifying file fragment types using a Transformer-based model, designed to significantly enhance the speed and accuracy of forensic investigations. Unlike traditional methods, which rely on handcrafted rules or shallow machine learning techniques, our model leverages the powerful Swin Transformer V2 architecture, a state-of-the-art deep learning model tailored for sequence-to-sequence tasks. The model was trained to recognize complex, hierarchical patterns within raw byte sequences, enabling it to classify file fragments with high precision and reliability. We demonstrate that our model outperforms traditional methods on 512-byte file blocks, achieving superior classification accuracy on the File Fragment Type dataset (FFT-75), and also shows strong competitive performance with larger 4 KiB file blocks. Our approach represents a significant advancement in digital forensics, automating the classification of fragmented data and improving the reliability and efficiency of evidence recovery. Future work will focus on optimizing the model for different file block sizes and evaluating its application to real-world fragmented data scenarios. By automating the identification of file fragment formats, our approach not only improves classification accuracy but also reduces the time required for investigators to recover critical evidence from fragmented data sources. This work provides a promising tool for digital forensics practitioners, advancing recovery capabilities in the face of evolving cyber threats.

Author Biographies

Andrey Guzhov, German Research Center for Artificial Intelligence (DFKI)

Dr. Andrey Guzhov is a Researcher at the DFKI research department "Smart Data & Knowledge Services," specializing in audio and multimodal deep learning. His work focuses on applying AI-based methods to real-world tasks, developing innovative solutions for data carving and digital forensics.

Christoph Tobias Wirth, German Research Center for Artificial Intelligence (DFKI)

Dr. Tobias Wirth is Team Lead of the "Generative & Transparent AI" group at the DFKI research department "Smart Data & Knowledge Services". He is coordinator of a DFKI Transfer Lab which develops AI solutions in collaboration with law enforcement agencies. His work is centered on application-oriented trusted AI in practice.

Downloads

Published

2025-06-25