My previous post "Source Address Classification - Feature Engineering" introduced the background of the source address classification problem and a critical part of the work - generating the features to be fed into a machine learning model. Before training a supervised classifier, we want to do clustering to explore the inherent grouping of the data to get more ground truth. By clustering, we can also test our feature set to see if it captures the differences between a resolver's traffic and a monitor's traffic.

Clustering is an unsupervised learning method, unlike supervised learning, it doesn't learn from a prior knowledge (training data), but infers the natural structure present within the data. By clustering, we expect to group source addresses with similar features, thus each group is likely to contain a certain type of sources (resolver, monitor or unknown). In this post, we'll share our experience of clustering source addresses based on their DNS queries to .nz name servers.

Data preprocessing

As mentioned in my previous blog, by removing noise and filtering out low active source addresses, our dataset was reduced to 82k samples. Each sample has 49 features, the values of which vary in different scales. For example, the feature of 'the number of unique query types per day' has values in [1, 16], while another feature, 'the number of unique domains queried per day', varies from 1 to 1M. Algorithms that rely on geometrical distances between data points, such as K-Means and Gaussian Mixture Model, are sensitive to feature scales. The features with high magnitudes could dominate the objective function and make the model unable to learn from other features.

To make each of our features weight equivalently in the clustering, we need to normalize the data so that all features are on a similar scale. There're various scalers and transformers available here, and their effects on clustering are quite subtle. I picked Standardization and Quantile Transformation as the two options included in our evaluation. Standardization is generally used in machine learning models, which standardizes each feature by shifting the mean to 0 and scaling the variance to 1. However, it's sensitive to the presence of outliers as illustrated in this example. Quantile Transformer is robust to outliers and can transform the features to follow a Normal Distribution, which is assumed by many machine learning algorithms.

Choosing algorithms

There're different types of clustering algorithms based on the methodology used and the definition of clusters. Each algorithm has its pros and cons as compared here. It's very common that an algorithm that works well on one dataset could fail on a different dataset. The more you know about your data, the easier for you to choose the right algorithm. However, most of the time, we don't know much about our data, that's why we do clustering to get some insights. In addition, all algorithms have parameters; how to pick settings for the parameters also depends on your data. Due to these traits, choosing the right clustering algorithm is a process full of experimentation and exploration. As all the practical machine learning problems, the principle is to try a most common, easier and faster algorithm before looking at more complex ones.

When choosing a clustering algorithm, we need to consider multiple factors, such as stability, scalability, performance, parameter intuition, etc. Besides, some algorithms allow us to choose the metric used to measure the similarity between data points, which could yield a different clustering result. While Euclidean Distance is often the default and most commonly used, a variety of distance metrics are available. You could get a different clustering result by choosing a different similarity metric.

Model selection

Grid search is commonly used in supervised learning for hyper-parameter tuning and model selection. But for unsupervised learning like clustering, there's no existing method that automates the model selection process. This is possibly due to the fact that there's no evaluation standard universally applied to every case.

A range of metrics exists for clustering results assessment. Most of the metrics require the knowledge of the ground truth classes of the samples, which is rarely available in practice, such as Adjust Rand Index, Homogeneity Score, and Completeness Score. A few internal metrics don't require knowledge of the ground truth classes, such as Silhouette Score that measures how well the resulting clusters are separated.

Our objective is to separate resolvers and monitors, and we have a set of samples labeled with these two types, so we can calculate the scores on these samples using their ground truth classes and predicted classes. Finally, I choose Adjust Rand Index as the criteria for selecting the best performing model on our data.

We tried K-Means, Gaussian Mixture Model, MeanShift, DBSCAN and Agglomerative Clustering with adjustments of hyper-parameters. The model with the highest Adjust Rand Index is Gaussian Mixture with 5 clusters, using feature standardization. It has Adjust Rand Index of 0.849789 (1 is the perfect match), Homogeneity Score of 0.960086 (1 means the best homogeneity of each cluster), Completeness Score of 0.671609 (1 means all members of a given class are assigned to the same cluster). The lower completeness score is due to that resolvers are separated into multiple clusters, which is acceptable as long as they are different clusters from which the monitors fall in.

Verification with ground truth

We collected some known addresses from the sources below:

  • Monitors:

    • ICANN monitoring
    • Pingdom monitoring
    • ThousandEyes monitoring
    • RIPE Atlas Probes
    • RIPE Atlas Anchors
    • AMP
  • Resolvers:

    • ISP
    • Google DNS
    • OpenDNS
    • Education & Research: Universities, research institutes, REANNZ
    • Addresses collected from RIPE Atlas probes: we ran an experiment involving RIPE Atlas probes around the world and collected the addresses of resolvers they used to send DNS queries.

Not all of them showed up in our data, and some were just not active enough and were filtered out as noise. Five clusters are predicted by the model selected with the highest score on these known addresses, as described in the last section. Now let's take a look at their distribution across these clusters. Here's a summarized table and a normalized heat map (Monitors are marked in red to distinguish them from resolvers):

We have the following observations:

  • Only 106 out of 15k monitor addresses and 2757 out of 4490 resolvers are in the table. Those missing samples are either invisible in our query traffic or removed as noise in our data cleaning step. RIPE Atlas Anchors and AMP monitors are completely absent in our test.
  • Cluster #1 captures all monitoring samples from ICANN, Pingdom and ThousandEyes, while it doesn't capture most RIPE Atlas Probes. The reason why some RIPE Atlas Probes behaved differently from other monitors could be the User-defined Measurements (UDM) with random behaviors.
  • 86% of OpenDNS resolvers also fall into Cluster #1, but referring to the numerical table above, there're only 7 samples from OpenDNS in our dataset, which makes the data less significant. But why only 7 showed up while we have 90 OpenDNS samples in total? We found that many OpenDNS addresses were removed as noise in the data cleaning due to low traffic and visibility. We need to further look into OpenDNS's specific behavior.
  • All Google DNS resolvers fall into Cluster #2, #4, well captured by the model.
  • We collected 2199 addresses of resolvers used by RIPE Atlas probes, they're dispersed over all clusters. By default, a RIPE Atlas probe acquires DNS resolver information using DHCP. It can also be configured to use a target resolver in some DNS measurements. So this sample set is likely to contain various resolvers, big and small, common and uncommon (like resolvers set up for experiments). That could be the reason for the dispersion.

We remove samples from RIPE Atlas Probes, OpenDNS, RIPE Atlas collected resolvers, whose pattern is not clear, and then aggregate the rest resolvers and monitors. We can see a clearer pattern:

Cluster #1 captures the monitors, while Cluster #2, #4 capture 97% resolvers. Cluster #0, #3 are two big clusters and we don't have many known samples fall in, so their patterns are not clear.

Next step

Our clustering model segregates the monitors and some major resolvers quite well. There're still some unclear patterns, and we need to improve our ground truth to further verify the model. Some cutting-edge techniques can be applied to visualize high dimensional data and interpret the model to help us better understand the clustering result. We'll explain that later.