Automatic close domain name variant detection using string similarities
Phishing, as one of the most popular social engineering attacks, aims to obtain an Internet user’s personal information through a phony website that looks like a legitimate site, or by sending a phishing email. Such attacks are causing serious damages and losses. New Zealanders reported direct financial losses of $6.5 million in cybersecurity incidents according to insights from 2019 Q2 CERT NZ’s report. Among different types of cybersecurity incidents, 45 percent are phishing and credential harvesting (445 incidents for 2019 Q1). The InternetNZ Research Team is responding to these pervasive attacks in frames of domain abuse detection project along with fake online store detection. We are developing an approach to automatically detect domains that are similar to popular New Zealand brands and might be registered with non validated registrant contact details and are associated with URLs and emails for malicious use.
One commonly used way is creating a blacklist of URLs that are classified as malicious sites. If a requested webpage is on the list, the connection will be blocked. Although this approach has low false-positive probability, a high-quality list requires constant efforts to maintain and therefore, this technique can fail to detect temporary phishing sites. Unfortunately, the average lifespan of a phishing site is extremely short (e.g. under 16 hours as in 2016).
Another method is the heuristic-based detection technique which extracts and analyses features from domain name and/or webpage and uses these features to detect/classify malicious domain/webpage. In this blog post, we are introducing a close domain name variant detection approach that uses features from domain names. This model is fast to train and is a valuable tool to detect potentially malicious domains at an early stage and to generate warnings for the registry and the Domain Name Commission staff to take further action.
There are several approaches to convincing an Internet user that a domain name or website is legitimate or looks like a popular brand. Take this URL as an example: http://paypal.com-security.active-userid.com. Although the real domain name is active-userid.com, one can make the domain look like paypal.com by adding subdomains. In our case, since the data we are using is only the domain name (e.g., domain.co.nz) part of the URL, which can only be set once at registration, our approach aims to detect the close domain name variant cases summarized in the table below:
|Type of Variant||Example|
|Char omission, permutation, substitution||payspal.co.nz|
|Combo using ‘-’||www-secure-paypal.co.nz|
|Add/change second level||paypal.ac.nz|
That is, the cases where a domain name seems normal (i.e. it is not close variant to any popular domain/brand) but is used for other purposes, are outside the scope of this research project.
The features used to find close variants to a domain name are string similarity metrics. For any two domain names, we compare them by calculating similarity metrics in two ways:
Edit-based similarity metrics are calculated by counting the number of operations required to transform one domain name to another. Different similarity metrics use a different set of operations. One of the most popular metrics is Levenshtein distance. For example, the Levenshtein distance for (google, goggle) is only 1. Other edit-based similarity metrics used in our approach includes Jaro-Winkler, N-Gram, Longest Common Subsequence, etc. Note that these distance measures are normalized so that the corresponding similarity metrics are values [0,1].
Phonetic-based similarity metrics quantify how similar one domain name is to another in terms of pronunciation. Several phonetic algorithms are developed to index words. Texts that sounds similar will have the same index. For example, metaphone index for 'iphone' and 'ifone' are both 'AFN'. To further compare two phonetic indexes, we apply edit-based metrics to turn them into numeric values and then use their mean value as the final metric for phonetic similarity.
The following chart illustrates the close domain name variant detection process, which includes two phases: training and detection. The challenge to apply machine learning to this problem was that we don't have enough ground truth data for the positive cases (i.e. pairs of brands and the corresponding close domain variants) to train the classifier. To tackle this issue, we used dnstwist to generate close domain name variants for a set of popular .nz domains, combined with randomly selected non-close variants to make a balanced training data set. All the domain names go through a preprocessing stage where the top level domains and some general words are removed (i.e., 'newzealand', 'aucklands', 'accounts'). This step is essential as the shorter the string is, the easier it is to find its close string matches.
The result of training the classifier with artificially generated data is the high false positive rate during the detection stage. Since the classifier is trained on positive data of highly close variants generated from dnstwist and negative data of highly different domain names, the classifier tends to make positive predictions when it sees a domain name only a little close to another. We solve this issue by feeding false positive detection back to the training stage and use it to retrain/update the classifier and repeat this iterative process until the performance of the classifier is acceptable in the detection stage.
Results and future work
Testing the final classifier on 2000 domain names against 200 popular New Zealand brands, let us detect 9 domains, all of them are highly close variants to brands including 'itunes', 'spark', 'noelleeming', etc. Currently, this approach is being refined combined with more testing and will go into production to enable ongoing detection and report for registry and DNCL staff’s use.
On the one hand, the performance will be continuously monitored and maintained, and the classifier will be updated/retrained regularly as well when more ground truth training data is available. On the other, the list of brands and keywords to check will be monitored and updated to capture seasonal trends and popular news and events. For example, words related to 'tax' and 'IRD' can be added to the list during tax payment season when taxpayers are targeted.
A possible future improvement on the approach is to enable detection for other languages, such as Māori. Recently, IDN homograph attack has attracted lots of interest. Since registrations of IDNs were permitted in 2010, we are looking forward to enable detection for IDN close variants in the near future.