Theory-Guided Feature Selection in Cybercrime Data Science


  • Shiven Naidoo The University of the Witwatersrand
  • Rennie Naidoo



Behavioural Science, Feature Selection, Cybercrime Data Analytics, Cybersecurity, Theory-Guided Data Science


Cybercrime data science is being significantly hampered by the presence of 'noisy' features within vast and complex datasets. We draw from the theoretical insights of the behavioural sciences to propose a feature selection model to enrich and improve the value and interpretability of cybercrime intelligence datasets. We piloted our theory-guided feature selection approach on a subset of intelligence datafeeds provided by a global fraud and cybercrime tracking firm. The results of the proposed social influence feature selection model show significant improvement in the interpretability of the machine learning-based exploratory analysis and advanced visualization techniques in an experimental setting. The feature selection model yielded rich insights about cybercriminal psychological tactics from social engineering scam data and has potential applicability in the areas of cyberthreat response and cybersecurity awareness training. Our study shows the value of an interdisciplinary theory-guided approach to cybercrime data analytics that integrates scientific knowledge from the behavioural sciences and data science expertise. Our paper concludes by suggesting avenues for future research on theory-guided feature selection seeking to incorporate behavioural science knowledge in cybercrime data science. We intend to refine, automate, evaluate, and scale our model in future research to assess its effectiveness in producing insights about cybercriminal activities and informing decision-making in a naturalistic and real-time setting. In future research efforts, we aim to automate the encoding of features and apply a wider range of machine learning tools and evaluation metrics to extract more meaningful insights into cybercriminal psychological tactics. We also intend to refine our model on larger datasets to enhance its efficiency and responsiveness to real-time cybercrime data.  We call on data scientists and cybercrime domain experts to work together to apply theory-guided feature selection to improve processes of knowledge discovery that enhance our cybersecurity capabilities.