Predicting Sabotaged Open-source Libraries
DOI:
https://doi.org/10.34190/eccws.25.1.4601Keywords:
Vulnerabilities, Supply Chain Attack, Open Source, Machine Learning, Malicious CodeAbstract
Open-source software provides free and publicly available software maintained by the open-source community. The variety of contributors creates an environment conducive to the intentional and unintentional introduction of software bugs by participating organizations. Enemy nation-states and independent hackers can exploit these attack vectors to gain access to industry and government systems. Repositories of known vulnerabilities and tools to check vulnerable versions and analyze code exist, but realistically, reviewers can miss issues within many repositories due to constant updates and technological advances. Hence, this research investigates an alternative, non-code-based method for identifying high-risk repositories using repository metadata and commit history, which, when coupled with machine learning, enables us to identify at-risk repositories at rates above 60%. This was achieved using a dataset composed of 41,710 repositories. The contribution of this research is twofold. First, it presents an empirical evaluation of the viability of a non-code-based analysis approach to detecting high-risk, i.e., potentially compromised code repositories. Second, it provides foundational research for non-code-based filtering of open-source repositories, potentially accelerating software investigations and reducing resource requirements.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 European Conference on Cyber Warfare and Security

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.