Information | Free Full-Text | Rapid Forecasting of Cyber Events Using Machine Learning-Enabled Features

Recent years have seen an increase in the number and volume of data breaches due to the availability of sophisticated tools and complex attacks from groups affiliated with state actors and organised criminals. The research community and industry have been working together to come up with solutions to address these security challenges, particularly in predicting cyber events more accurately. This paper contributes to that body of knowledge and aims to forecast cyber attacks based on certain cyber events and features on the network. We utilise data-driven approaches to predict these events before they occur to help the security teams better respond to such threats.

The next part will cover intrusion detection, including a detailed overview of the existing research.

2.1. Intrusion Detection

In recent times, the escalation of cyber attacks has prompted efforts aimed at identifying and preventing these intrusions with varying degrees of success. Diverse technologies, such as Intrusion Detection Systems (IDS), Intrusion Prevention Systems (IPS), Security Information and Event Management Systems (SIEMS), firewalls and anti-virus systems have been implemented to detect attacks and notify security teams. While these tools play a pivotal role in detecting and preventing cyber attacks, they are susceptible to generating false alerts, and accurately pinpointing sophisticated attacks remains a persistent challenge [18]. To combat cyber intrusions, several methodologies have emerged, primarily classified into two categories: signature-based intrusion detection systems and anomaly-based intrusion detection systems. Signature-based detection is effective against attacks with known signatures, while anomaly-based detection excels in identifying new attack patterns. Intrusion Detection Systems (IDS) are broadly categorised into three types: Network Intrusion Detection Systems (NIDS), Host Intrusion Detection Systems (HIDS) and Hybrid Intrusion Detection Systems. Among these, Network Intrusion Detection Systems (NIDS) represent the most widely embraced category of IDS, tasked with analysing network traffic to spot anomalies. Upon detection, these systems generate security alerts that are then prioritised and addressed by the security team. Examples of NIDS include Zeek [19] and Snort [20]. Researchers have explored the use of Machine Learning (ML) and Deep Learning (DL) methodologies to enhance the detection capabilities of NIDS. ML and DL-based NIDS models typically rely on datasets and usually encompass multiple stages, which are (i) data preparation, (ii) training and (iii) testing. In the data preparation stages, the dataset is prepared to make it suitable for machine learning, and it is then split into training and testing portions. Several authors have proposed NIDS models, but researchers are still working on improving the detection accuracy and minimising false alarms. In [21], the authors proposed a model based on deep learning approaches for network intrusion detection and utilised sparse auto-encoders. They trained the model to classify network traffic into benign and attack, but the approach was tested using binary classifications. In [22], the authors proposed a network intrusion detection model and utilised unsupervised autoencoders. They used a heuristics threshold to improve the detection accuracy of their proposed IDS. Reference [23] proposed an intrusion detection system using the Ensemble Core Vector Machine (CVM) approach to detect various types of attacks, including probe and DoS attacks. According to the authors, the model achieved a high accuracy result.

Host Intrusion Detection Systems (HIDS) detect anomalies in host systems and generate alerts. This is mainly installed on critical systems where security protection is essential. It also helps collect indicators of compromise following suspicious activities reported by the HIDS system. Examples of such activities include unauthorised access attempts and unauthorised modification of files or programs. It is always good to correlate HIDs logs with other monitoring tools to help prioritise genuine threats. Examples of HIDS include Splunk [24] and Open Source Security Event Correlator (OSSEC) [25]. Several authors have carried out work improving the accuracy of HIDS. In [26], the authors proposed a HIDS model for cloud computing. The model alerts users when suspicious activities are detected based on systems called traces and classifies them using a KNN classifier. In [27], the authors proposed the HIDS model for Supervisory Control and Data Acquisition Systems (SCADA). Reference [28] used a combination of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) detection models, which led to an improved detection result.

Hybrid intrusion detection systems amalgamate two or more methods to enhance intrusion detection, diverging from conventional IDS approaches reliant on either signature-based or anomaly-based detection. Numerous researchers have introduced models in this domain. For instance, ref. [29] suggested a hybrid IDS model specifically designed to identify cyber attacks on the web. Their method combined signature-based and anomaly detection, achieving an accuracy rate of 96.7%. Similarly, ref. [30] proposed a model integrating anomaly-based and signature-based approaches to identify attacks on IoT networks. Their model encompassed three stages: traffic filtering, preprocessing and a hybrid IDS. In another instance, ref. [31] presented a hybrid IDS detection model for IoT, targeting the detection of Denial of Service (DoS) attacks and network traffic analysis. Any deviations from the standard were classified as potential attacks. Reference [32] proposed a hybrid architecture for IDS tailored for the Internet of Vehicles. Their architecture, based on Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), merged several datasets containing DDoS attacks and car hacking incidents to assess their model’s performance. Their model achieved an overall detection accuracy of 99.5% and 99.9% for DDoS and car hacking, respectively. Lastly, in [33], the authors introduced a cyber kill chain-based hybrid IDS framework for a smart grid. They applied the cyber kill chain to identify cyber attacks at different stages of the chain.

While recent advances have seen an increase in the deployment of machine learning and deep learning approaches for improving detection accuracy, these models’ accuracy depends on the quality of the datasets used. Some of the prominent IDS datasets include the KDD99 and NSL-KDD [34], which contained features that were used to differentiate normal and abnormal traffic. Other datasets that have been widely used include the Kyoto [35], UNSW-NB15 [36] and CICIDS-2017 [34] and CIC-IDS-2018 [34] datasets. Most of the work on intrusion detection research has been based on using machine learning data and using classification and performance metrics, such as percentage accuracy. For example, most of the work on the datasets above has used ML and DL approaches to extract features and perform feature engineering and classification to fine-tune the parameters to achieve the best accuracy results. Our work explored cyber event forecasting, which has not been explored widely in the cyber domain, and the forecasting work is not there to replace intrusion detection but to complement it.

Next, we will cover cyber event forecasting, predictions and related work. We will also briefly cover some of the other domains where forecasting has been applied and use it to inform our work.

2.2. Forecasting and Predictions

Researchers have recently shown interest in cyber attack forecasting, and their work is contributing to the body of knowledge. Their work ranges from survey papers to machine learning models and achieving varying results. Most of this work is on sentimental analysis and based on social media feeds, although others are looking at other attacks, such as DoS and malware variants. In [37], the authors performed cyber attack forecasting using machine learning techniques using data breaches spanning over 12 years. They analysed the data and found the threats of cyber attacks to increase in frequency but not magnitude. Reference [38] used machine learning to predict the cost of cyber breaches with the view that their work could also be used to predict premiums in cyber insurance. References [39,40] used sentimental analysis to predict cyber attacks. In Ref. [41], the authors proposed a method aimed at aiding incident responders in predicting the possible functionalities of malware post-detection. Their methodology is grounded in a probabilistic model, empowering the forecast to recognize a range of capabilities and gauge the probability of each capability being executed. As per the authors, their approach not only unveils potential capabilities but also assigns weights based on the likelihood of their execution. Ref. [42] conducted an assessment of predictive methods’ capabilities in the field of cybersecurity. Their proposed method aimed to identify potential attackers through the utilization of network entity reputation and scoring mechanisms. Ref. [43] provided an an overview of prediction and forecasting techniques employed in the realm of cybersecurity. Their focus was on the predicting the intention of the attackers and anticipating potential attacks that might impact the the overall security status of the network. Ref. [44] examined the present research directions concerning cyber attacks by scrutinizing the data-driven methodologies utilized by researchers in this swiftly evolving domain. Additionally, highlighted challenges and potential future trajectories within this field.

Time series-based techniques are widely adopted for forecasting future events. Such techniques are based on autoregressive time series and other models based on neural networks. Other well-known forecasting methods include ARIMA, linear regression, SMOreg, Gaussian process and multilayer perceptron. Reference [45] proposed time series-based anomaly techniques for dealing with adversarial attacks. Author [46] carried out a review of time series-based anomaly detection techniques and found there was no single technique that outperforms the others. Reference [47] applied time series techniques to build their predictive model. The model was used to detect vulnerabilities in internet browsers.

2.2.1. ARIMA

ARIMA, a statistical technique utilising time series data for future trend prediction, was explored in a study by [48]. The authors studied the data of the given parameters to improve the forecasting using ARIMA and Exponential Smoothing (ETS). The two forecasting methods were compared using parameters such as pressure and humidity. The accuracy was also compared using metrics such as MAE (Moving Absolute Error) and RMSE (Root Mean Square Error). ARIMA has been used for a long time, although there are some limitations with ARIMA models and, in particular, the difficulty of modelling nonlinear relationships [49]. Reference [50] used ARIMA-based forecasting to predict future cyber attacks based on historical incidents.

The authors [51] surveyed the prediction techniques used in cyber security and concluded their effectiveness is linked to the context in which they are used and the research direction. Reference [52] proposed a time series technique for predicting data breaches based on the size and incident time derived from historical data. They used Seasonal Autoregressive Integrated Moving Average (SARIMAX) and Recurrent Neural Networks (RNNs), and both models achieved good performance results. Reference [53] studied ARIMA and SARIMA models and evaluated them for long-term runoff forecasting. The results showed that the SARIMA model performed better than ARIMA at forecasting the annual runoff. ARIMA is a suitable statistical method for forecasting and only requires time series data, although the data has to be stationary.

2.2.2. Linear Regression and SMOreg

This algorithm predicts the correlation between two features and evaluates their connection [54]. Typically, there are dependent and independent variables involved. SMOreg offers an SVM-based solution for handling regression problems, excelling particularly in modelling and predicting with non-linear data [55].

2.2.3. Deep Learning

Deep learning (DL) is within the realm of machine learning and typically consists of multiple layers, including a hidden layer, which allows it to learn from the feature representations [56]. Several deep learning algorithms exist, including neural networks, Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs) and Autoencoders.

Several authors have proposed models for detecting cyber attacks based on CNN. For example, ref. [57] proposed a CNN-based method for detecting cyber attacks in industrial control systems. Reference [58] developed a CNN-based method for detecting web attacks based on HTTP request packets. Ref. [59] utilised a deep learning approach based on CNN-LSTM to detect malware in real time, and the proposed model achieved a high accuracy of 99%.

Reference [60] proposed a DoS detection technique based on LSTM and Bayes and achieved a good performance, according to the authors. References [61,62] used DL techniques for IDS based on the CIC-IDS2018 dataset to improve intrusion detection and CNN and LSTM techniques. In [63], the authors proposed a deep learning technique for detecting cyber attacks. The proposed model used RNN, LSTM and Multilayer Perceptrons (MPs) using a CTT and achieved an accuracy of 93% on LSTM. Reference [64] presented an IDS system based on a deep auto-encoder using the KDD-CUP’99 dataset to evaluate the performance of their model and achieved good results.

Source link