Generating Threat Insights Using Data Science

As the amount of cyber crime and cyber damage grows over time, the need for protecting assets and organizations is paramount.

As discussed in previous blogs “predicting-vulnerabilities-in-compiled-code” and “prioritizing vulnerabilities: a holistic approach”, the security teams are flooded with information about vulnerabilities found in third-party applications. While struggling with human capacity limitations, they need to keep their organization safe not only from these vulnerabilities, but also zero-days and yet-to-be discovered vulnerabilities.

To succeed with this task, they need to analyze their organization’s data and identify threats on a regular basis. In the past, data was analyzed manually using a list of predefined rules or applying simple statistics algorithms on the data. Today, with the growth in data volume, it is impossible to rely on these methods. We have to rethink them and adopt data science and machine learning techniques that can handle the large quantities of data.

Machine learning algorithms are capable of accurately analyzing big data. Compared to the past where we were limited to a predefined list of rules, we can now train a machine learning (or deep learning) model—a function that maps an input to an output—to factor a higher number of properties into its decision-making process. Therefore, data science tools enable us to customize the threat identification process.

Customized Threat Detection

The volume of data generated keeps growing as well as the amount of available sources, including data that is not cyber related. The different types of data can be leveraged to generate adjusted threat insights.

The basic steps of training a machine learning model to analyze the data and identify threats are:

Data collection
Preprocessing and feature selection
Model training and evaluating

Data Collection

The machine learning model output, meaning the function that maps an input to an output, is generated based on the supplied data. Our model is only as good as the input data we provide. Therefore, it is important to put time and effort into collecting and preprocessing relevant data sources.

Potential data sets that we want to collect can be categorized into two groups:

External data - the data that is generated and published by others. For example, CVE datasets, exploits datasets, security feeds, and social networks such as Twitter.
Internal data - the data that is generated by and on the organization’s assets and tools. For example, filesystem, memory (running processes), network usage, and logs.

Preprocessing and Feature Selection

After collecting all the raw data, we want to clean and analyze it so we can turn it into many different features to feed our model. Different methods will generate different features. Some can be directly generated from the data, e.g. the app's installation location in the filesystem. Others will be generated by applying algorithms that learn basic information from the data, for example how often the app is used.

Some features will be generated by applying machine learning algorithms that can predict new information, for example predicting the likelihood of a vulnerability to be exploited or predicting when an exploit would exist. Another example can be automatically analyzing the CVE description to receive information about the attack type and impact. These machine learning “sub-researches” analyze a part of the data and produce new information that will assist us in the full implementation of a machine learning model for identifying threats.

Using our data collections, we can create a diverse feature collection that will include information about the vulnerability or zero-day:

The risk - likelihood/time of exploitation and the damage as a result of exploitation
The usage - who, how many, and which credentials are used in the vulnerable app
The possible treatments - existence of an official patch/patchless protection
The user’s operations - which threats did the user choose to treat in the past

Applying Feature Selection Algorithms

Our features are the input for the model we are about to train. When we have a high number of features, we might wish to reduce it (dimensional reduction) to those that are most useful to the model to decrease the computational cost, improve the performance of the model, and avoid overfitting (where the output function corresponds too closely to our training dataset and is not reliable for predicting new inputs). The focus of the feature selection method is on selecting the most informative features. An example of such a method is the Lasso regression algorithm.

Model Training

Learning Types

In general, machine learning is a task of learning a function that maps an input to an output.

Supervised Learning

When the learning is based on labeled data, e.g. input-output pairs, the learning is referred to as supervised learning. A supervised learning algorithm analyzes and generalizes the training data to produce a function, which can be used for mapping new examples (correctly determining the class labels for unseen instances). One of the main objectives of supervised learning is to solve classification problems, for example predicting the likelihood of a vulnerability to be exploited.

Unsupervised Learning

In contrast to the supervised learning that uses labeled data, unsupervised learning detects patterns in a dataset that are not labeled. One of the main methods used in unsupervised learning is cluster analysis. Clustering groups a set of objects so that objects in the same cluster are more similar to each other than to those in other clusters. Possible treatments recommendation is an example of a clustering method.

Semi-Supervised Learning

A variant that combines supervised and unsupervised techniques.

Live Threat Insights

Using the features mentioned above, we can train a model that classifies the data into MITRE threat groups. Given an organization’s data, it identifies risky combinations of information in software that are associated with a MITRE threat.

Threat Examples

Before training the model, the security team needs to analyze a high volume of data and search for threats. After training the machine learning model, the algorithm automatically identifies the risky combinations for them.

For example:

Exploit Public-Facing Application - Identifying a vulnerable version of Firefox that is installed on more than 50% of the organization’s assets, is highly used, and uses the network.

Exploitation for Defense Evasion - Identifying an exploitable critical vulnerability in Active Directory that is installed on three assets.

Exploitation for Credential Access - Identifying a vulnerable version of python that is running with admin credentials on twenty assets and exposes the organization to credential theft attack.

Exploitation of Remote Services - Identifying a vulnerable version of OpenSSL that uses vulnerable port, is installed on more than 50% of the organization's assets, exposes it to RCE attack, and has an available patch.

Compromise Software Supply Chain - Identifying a vulnerable version of Splunk that exposes the organization to loss of data integrity in case of exploitation.

Exfiltration Tactics - Identifying vulnerable Sharepoint Server that is highly used, uses the network, and exposes the organization to a sensitive information disclosure in the event of a successful attack.

Topia Customized Threat Detection

In a world where the volume of data keeps growing and past methods for analyzing data and identifying threats can no longer be reliable, Topia offers a threat detection solution based on data science tools. Topia keeps track of changes in your data and automatically alerts you about new threats. So teams can better understand the threat, Topia presents some of the features that the model based its decision on.

Feel free to give it a try and see what insights you discover!

LIVE! On Security Weekly

They're back and at it again! That's right folks, the Tel Aviv Tandem returns to the show for a holiday episode full of, well, you guessed it, data science!