Siamese Neural Network and Machine Learning for DGA Classification




Siamese Neural Network, DGA classification, Cybersecurity


Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn oversee recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without considering external information such as DNS packets features.