LLM Supply Chain Provenance: A Blockchain-based Approach

Authors

DOI:

https://doi.org/10.34190/icair.4.1.3128

Keywords:

Blockchain-based data provenance, LLM Security, Generative Models

Abstract

The burgeoning size and complexity of Large Language Models (LLMs) introduce significant challenges in ensuring data integrity. The proliferation of "deep fakes" and manipulated information raises concerns about the vulnerability of LLMs to misinformation. Traditional LLM architectures often lack robust mechanisms for tracking the origin and history of training data. This opaqueness can leave LLMs susceptible to manipulation by malicious actors who inject biased or inaccurate data. This research proposes a novel approach integrating Blockchain Technology (BCT) within the LLM data supply chain. With its core principle of a distributed and immutable ledger, BCT offers a compelling solution to address this challenge. By storing the LLM's data supply chain on a blockchain, we establish a verifiable record of data provenance. This allows for tracing the origin of each data point used to train the LLM, fostering greater transparency and trust in the model's outputs. This decentralised approach minimises the risk of single points of failure and manipulation. Additionally, the immutability of blockchain records ensures that the data provenance remains tamper-proof, further enhancing the trustworthiness of the LLM. Our approach leverages three critical features of BCT to strengthen LLM security: 1) Transaction Anonymity: While data provenance is recorded on the blockchain, identities of data contributors can be anonymised, protecting their privacy while ensuring data integrity. 2) Decentralised Repository: Enhances the system's resilience against potential attacks by distributing the data provenance record across the blockchain network. 3) Block Validation: Rigorous consensus mechanisms ensure the validity of each data point added to the LLM's data supply chain - minimising the risk of incorporating inaccurate or manipulated data into the training process. Using the experimental approach, initial evaluations using simulated LLM training data on a blockchain platform demonstrate the feasibility and effectiveness of the proposed approach in enhancing data integrity. This approach has far-reaching implications for ensuring the trustworthiness of LLMs in various applications.

Author Biographies

Shridhar Singh, University of KwaZulu-Natal

Shridhar Singh is a student and teaching assistant in the School of Mathematics, Statistics, and Computer Science at The University of KwaZulu-Natal, South Africa. He is pursuing a MSc in Computer Science with a focus on Large-Language Models. His work featured at the ICCWS 2024 and THREAT 2023 Conferences. His research areas are generative AI, Blockchain, and cryptography.

Luke Vorster, University of KwaZulu-Natal

Luke Vorster serves as a lecturer and academic advisor in the School of Mathematics, Statistics, and Computer Science at The University of KwaZulu-Natal for the past 2 decades where he plays a crucial role as academic supervisor and module coordinator. Vorster boasts a wealth of experience in AI, applied cryptography, cyber-security, and has a keen eye for research and development trends in computer science and the IT industry.

Downloads

Published

2024-12-04