<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[.nz Registry blog]]></title><description><![CDATA[Research, thoughts, stories and ideas from the team at the .nz Registry


We are a provider of critical Internet infrastructure and authoritative data services.






]]></description><link>https://blog.nzrs.net.nz/</link><image><url>https://blog.nzrs.net.nz/favicon.png</url><title>.nz Registry blog</title><link>https://blog.nzrs.net.nz/</link></image><generator>Ghost 1.26</generator><lastBuildDate>Tue, 17 Dec 2019 13:41:03 GMT</lastBuildDate><atom:link href="https://blog.nzrs.net.nz/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Building and deploying classifiers with Airflow and Docker]]></title><description><![CDATA[<div class="kg-card-markdown"><h1 id="classifyingfakeonlinestoreswithinnz">Classifying fake online stores within .nz</h1>
<p>InternetNZ is helping New Zealanders to harness the power of the Internet by ensuring that .nz is a safe place to do business. To play our part, the InternetNZ Research team is deploying a range of classifiers to identify malicious domains, starting with fake</p></div>]]></description><link>https://blog.nzrs.net.nz/building-and-deploying-classifiers-at-internetnz-with-airflow-and-docker/</link><guid isPermaLink="false">5ce35750f0566b00bfb0323c</guid><dc:creator><![CDATA[Gerard Barbalich (Resigned)]]></dc:creator><pubDate>Tue, 21 May 2019 02:08:00 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/airflow_docker_header_image-1.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><h1 id="classifyingfakeonlinestoreswithinnz">Classifying fake online stores within .nz</h1>
<img src="https://blog.nzrs.net.nz/content/images/2019/06/airflow_docker_header_image-1.png" alt="Building and deploying classifiers with Airflow and Docker"><p>InternetNZ is helping New Zealanders to harness the power of the Internet by ensuring that .nz is a safe place to do business. To play our part, the InternetNZ Research team is deploying a range of classifiers to identify malicious domains, starting with fake online stores. This post details how we are using <a href="https://airflow.apache.org/">Apache Airflow</a> within <a href="https://www.docker.com/">Docker</a> containers to deploy these machine-learning trained classifiers.</p>
<h1 id="maintainingtrustinnz">Maintaining trust in .nz</h1>
<p>“Indian Love Story” wasn’t a romantic tale - rather, it was a fake sneaker website selling underpriced Nike Air Jordans. The domain has since been removed from the Domain Name System, but it was part of a growing number of fake online stores looking to benefit from the reputation of .nz. So in response, we built a classifier that identifies such fake online stores, which are then passed on to the <a href="https://dnc.org.nz/">New Zealand Domain Name Commission</a> (DNC) for investigation.</p>
<h1 id="featureexperimentation">Feature experimentation</h1>
<p>The DNC gifted us a list of domains from user-reported cases to begin the training of this tool. We verified this list and combined it with a random sample from the .nz domain registry to form our initial dataset.</p>
<p>For inspiration on features that would best distinguish these two groups, we turned to published work from similar projects - including consulting with our colleagues at <a href="https://www.sidn.nl/">SIDN</a>. These brainstormed features were then split into three broad categories for engineering and experimentation; domain-centered, registry-centered, and site-centered.</p>
<p>Domain-centered features relate to the content of the domain name, including its string length and component words. Registry-centered features relate to registrant and registry information held by the domain registry, including; the number of domains a registrant holds, the tenure of a registrant, and the location of the registry. Site-centered features relate to the content on a domain, including text, images, and tags.</p>
<p>Experimentation began with domain-centered and registry-centered features. We expanded the number of features using <a href="https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219">FeatureTools</a>, used <a href="https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/">Recursive Feature Elimination</a> in combination with <a href="https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976">pipelines</a> to iterate through different feature combinations, and analysed feature importance using <a href="https://towardsdatascience.com/understanding-model-predictions-with-lime-a582fdff3a3b">LIME</a>. Unfortunately, the results with these features did not meaningfully distinguish the two groups. Further experimentation showed the simplicity of site-centered features to be key - with the best results coming from <a href="https://towardsdatascience.com/tfidf-for-piece-of-text-in-python-43feccaa74f8">Term Frequency-Inverse Document Frequency</a> (TFIDF) analysis upon the website text itself (sourced from our <a href="https://registry.internetnz.nz/dns/zone-and-web-scanning/">Web scans</a>).</p>
<h1 id="constructingapipeline">Constructing a pipeline</h1>
<p>A test classifier pipeline was constructed using Scikit-Learn's <a href="https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976">pipeline class</a> to join  a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TFIDF Vectoriser</a> and a <a href="https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd">Random Forest</a> classifier. Training and testing this classifier pipeline yielded initial results of:</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2019/05/01-pipeline-training-results-table.png" alt="Building and deploying classifiers with Airflow and Docker"></p>
<p>The trained classifier pipeline was then tested upon a fresh dataset - a random sample of fresh web content from the .nz domain registry. Consequently, several hundred new domains were labelled as “fake online stores”. These domains were manually classified by a team member to check the accuracy, along with a random sample of equal size from those domains labelled as “not fake online stores”. Based on this manual classification the trained classifier pipeline showed results of:</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2019/05/01-pipeline-testing-results-table.png" alt="Building and deploying classifiers with Airflow and Docker"></p>
<p>We noted that the classifier did show a slight tendency towards false-positive classification, however, our team was happy with the results overall, and decided to proceed to deployment.</p>
<h1 id="usingatrainedpipelineinconjunctionwithourregistryaugmentationplatformrap">Using a trained pipeline in conjunction with our Registry Augmentation Platform (RAP)</h1>
<p>The RAP is a scalable distributed framework we have designed to collect data on domains. Our Data Engineer Asher Halliwell is leading its design, writing it in Python using the Celery framework.</p>
<p>Utilising a microservice-centered structure in combination with building REST APIs to increase process modularity. This framework allows us to cue different processes as needed for individual workflow pipelines by calling the relevant API.</p>
<p>For example, the trained classifier pipeline from above is just one component of a multi-stage workflow to identify fake online stores. The entire workflow (pictured below) contains three preliminary steps before the classifier pipeline is used; gathering a list of domains to investigate; collecting content from those domains; and processing that collected content. Structuring the process in this way allows the first three APIs to be called separately for a range of different processes, not just the fake online store pipeline. While the RAP and separate APIs are designed for modularity and separation, we needed a tool to organise and orchestrate workflows in one place - for this, we have used Apache Airflow.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2019/06/RAP-archetecutre.png" alt="Building and deploying classifiers with Airflow and Docker"></p>
<h1 id="usingairflowanddockertoautomateourclassifierpipelines">Using Airflow and Docker to automate our classifier pipelines</h1>
<p>Using Apache Airflow and Docker to automate the running and reporting of classifiers is a popular trend within Data Science - with well-structured <a href="http://www.marknagelberg.com/getting-started-with-airflow-using-docker/">tutorial resources</a>.</p>
<h1 id="airflow">Airflow</h1>
<p>Airflow is an open-source workflow management system originally developed by Airbnb, allowing the automation and scheduling of scripts or workflows.</p>
<p>Airflow is a sophisticated way to schedule and run Extract-Transform-Load jobs as it allows; multiple layers of dependencies, and a simple but effective visualisation for monitoring all your jobs (<a href="https://medium.com/@jGage718/apache-airflow-on-docker-for-complete-beginners-cf76cf7b2c9a">detailed here</a>). Our team previously decided that Airflow was a good fit for managing our data workflows, and have already been using it to run <a href="https://blog.nzrs.net.nz/improving-data-workflows-with-airflow-and-pyspark/">PySpark in an Airflow task</a>.</p>
<h1 id="docker">Docker</h1>
<p><a href="https://docs.docker.com/get-started/">Docker</a> is a platform for developers and sysadmins to develop, pack, deploy, and run applications within containers. Using containers rather than virtual machines cuts down on performance demands drastically while maintaining the benefits of containment.</p>
<h1 id="runningairflowwithindocker">Running Airflow within Docker</h1>
<p>Layering Airflow with Docker gives us the benefits of both applications. First, we can automate, schedule, and monitor workflows within one system - Airflow. Second, as business needs dictate we can; adjust the process of that workflow, change its scheduled interval, or replicate and tweak it. Running Airflow within Docker maintains all of these advantages while adding the ability to  contain, replicate, and re-deploy the entire process as needed.</p>
<p>For example, we have stored a Docker image containing the trained classifier pipeline above, as well as the Airflow Directed Acrylic Graph (DAG) and associated plugins that schedule the entire process. We can now replicate this Docker image and alter it as needed; whether to test a proposed alteration in the pipeline or to host a different classifier and associated DAG. The ease with which we can adapt and experiment with this entire pipeline was a big drawcard for our team and a large contributor to why we are using Docker.</p>
<h1 id="creatingadagtorunourclassifierpipeline">Creating a DAG to run our classifier pipeline</h1>
<p>Workflows within Airflow are built upon DAGs, which use operators to define the ordering and dependencies of the tasks within them. Each operator typically defines a single task, commonly acting as triggers or markers of status.</p>
<p>Trigger operators within Airflow action events, while sensor (or &quot;status&quot;) operators verify states. We applied this model of utilising triggers and status operators throughout our DAG. Triggers and sensors are commonly separated due to the discrepancy in run-time between them.</p>
<p>For example, within our DAG the 'Webscan Trigger' operator triggers the initiation of a webscan for a selected list of domains. The running of this webscan may take between minutes to hours depending upon the number of domains. The 'Webscan Status' operator then periodically checks the status of the webscan until it is marked as complete. These actions are separated so that the status of the webscan can be periodically determined without repeatedly triggering it.</p>
<p>Cumulatively, the Airflow-relevant files for our classifier pipeline consist of four plugin files and one DAG file. Each plugin is responsible for a separate part of the ETL pipeline and the DAG file calls those plugins. These tasks align with the stages outlined in the RAP section above, and so each plugin calls a unique API where appropriate. The tasks for this classifier pipeline coordinated by this DAG are; loading a list of target domains, triggering the Webscan upon those domains, processing the content of that Webscan, loading processed data and making predictions, .</p>
<p>Here is an example of our DAG file:</p>
<pre><code>from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.domain_list import GetDomainList
from airflow.operators.webscan import WebscanTrigger
from airflow.sensors.webscan import WebscanStatus
from airflow.operators.webscan_process import WebscanProcessingTrigger
from airflow.sensors.webscan_process import WebscanProcessingStatus
from airflow.operators.load_and_predict import LoadAndPredict
from airflow.operators.email_operator import EmailOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'provide_context': True,
    'start_date': datetime(2019, 5, 7),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),
}

detection_dag = DAG(
    'fake_web_shop_detection', default_args=default_args)

get_domain_list = GetDomainList(task_id='get_domain_list', dag=detection_dag)

webscan_trigger = WebscanTrigger(task_id='webscan_start', dag=detection_dag)

webscan_completed = WebscanStatus(task_id='webscan_status', dag=detection_dag, poke_interval=2)

processing_trigger = WebscanProcessingTrigger(task_id='webscan_processing', dag=detection_dag)

processing_completed = WebscanProcessingStatus(task_id='processing_status', dag=detection_dag, poke_interval=2)

load_and_predict = LoadAndPredict(task_id='load_and_predict', dag=detection_dag)

predictions_completed = EmailOperator(task_id = 'predictions_completed',
                            to = '#########',
                            subject = 'weekly fake webshops predictions completed',
                            dag = detection_dag)

get_domain_list &gt;&gt; webscan_trigger &gt;&gt; webscan_completed &gt;&gt; processing_trigger &gt;&gt; processing_completed &gt;&gt; load_and_predict &gt;&gt; predictions_completed
</code></pre>
<h1 id="futurework">Future work</h1>
<p>Using Airflow and Docker has allowed our team to quickly and repeatedly deploy machine-learning trained classifiers. We will use the structure outlined here to deploy more classifiers, always aiming to solve interesting business problems in .nz. In future, we will be using this structure to build tools for; industry classification, and for the classification of parked domains and other forms of malicious domains.</p>
</div>]]></content:encoded></item><item><title><![CDATA[DNS Flag day, the aftermath]]></title><description><![CDATA[<div class="kg-card-markdown"><p>Back in October 2018 we <a href="https://blog.nzrs.net.nz/dns-flag-day/">blogged</a> about the upcoming <a href="https://dnsflagday.net">DNS Flag Day</a>, and how it could potentially affect .nz domains.</p>
<p>As InternetNZ regularly tested .nz domains for potential failure after 1 February 2019, together with a communication campaign involving DNCL to reach those affected to implement a fix, we accumulated</p></div>]]></description><link>https://blog.nzrs.net.nz/dns-flag-day-aftermath/</link><guid isPermaLink="false">5c579fde0e3b8700bf35db75</guid><category><![CDATA[DNS]]></category><category><![CDATA[.nz]]></category><category><![CDATA[DNS Flag Day]]></category><category><![CDATA[Research]]></category><dc:creator><![CDATA[Sebastian Castro]]></dc:creator><pubDate>Thu, 07 Feb 2019 22:43:38 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/02/DNS_Flag_Day_hero_07.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/02/DNS_Flag_Day_hero_07.png" alt="DNS Flag day, the aftermath"><p>Back in October 2018 we <a href="https://blog.nzrs.net.nz/dns-flag-day/">blogged</a> about the upcoming <a href="https://dnsflagday.net">DNS Flag Day</a>, and how it could potentially affect .nz domains.</p>
<p>As InternetNZ regularly tested .nz domains for potential failure after 1 February 2019, together with a communication campaign involving DNCL to reach those affected to implement a fix, we accumulated data that helps us to tell a story how our community reacted to this event.</p>
<p>As the DNS Flag Day initiative also gathered support from public DNS resolvers such as Cloudflare, Google DNS and Quad9, breakage expectations completely changed. Originally there was an expectation of a slow roll out, then turned into a faster rollout and more bite.</p>
<img src="https://blog.nzrs.net.nz/content/images/2019/02/Screen-Shot-2019-01-31-at-5.00.30-PM.png" alt="DNS Flag day, the aftermath" style="width: 50%;">
<p>To provide a mental refresher, the DNS Flag Day test applies 9 different EDNS tests to a nameserver, expecting a valid DNS response. If the query leads to a timeout, it's considered a failure. Any other DNS response is considered acceptable but not perfect.</p>
<p>The number of nameservers hosting .nz domains was quite stable during the collection period. Back in July 2018 25,816 nameservers addresses were tested, compared to 26,650 unique addressess at the end of January 2019.</p>
<p>The figure below shows how the compliance by nameserver changed in 6 months, by showing the fraction of nameservers fully passing each test.</p>
<div>
    <a href="https://plot.ly/~secastro/167/?share_key=a31ETFR78KwgVmzDcQa6uq" target="_blank" title="Plot 167" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/167.png?share_key=a31ETFR78KwgVmzDcQa6uq" alt="DNS Flag day, the aftermath" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:167" sharekey-plotly="a31ETFR78KwgVmzDcQa6uq" src="https://plot.ly/embed.js" async></script>
</div>
<p>In general there are improvements in all tests, mainly around the EDNS1 tests. A keen observer noticed actually a drop in the OPTLIST test.</p>
<h2 id="cznicanddomainstatus">CZ.NIC and domain status</h2>
<p>Our colleagues from CZ.NIC wrote a <a href="https://gitlab.labs.nic.cz/knot/edns-zone-scanner/">tool</a> to analyze a minimal set of nameservers and produce a state for a given domain. Their tool assigns a domain one state before and after the EDNS workarounds have been removed.</p>
<ul>
<li><strong>OK</strong>: All addresses for all nameservers of a domain pass the EDNS tests.</li>
<li><strong>Compatible</strong>: None of the EDNS tests produce a timeout. There might be some non-critical errors.</li>
<li><strong>High Latency</strong>: Some of the nameserver addresses generate timeouts.</li>
<li><strong>Dead</strong>: All nameserver addresses generate timeouts.</li>
</ul>
<p>The plot shows how the classification of the .nz register changed across time during the communication campaign.</p>
<div>
    <a href="https://plot.ly/~secastro/169/?share_key=kZVJsOaEIjqGfjdBVhXtln" target="_blank" title="Plot 169" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/169.png?share_key=kZVJsOaEIjqGfjdBVhXtln" alt="DNS Flag day, the aftermath" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:169" sharekey-plotly="kZVJsOaEIjqGfjdBVhXtln" src="https://plot.ly/embed.js" async></script>
</div>
<p>The evidence of the success of our efforts is shown here. We moved from 35% of domains passing with flying colors to over 70%, and we dropped the number of <strong>Dead</strong> domains from 14% to 7% percent.</p>
<p>Because there are many reasons why a nameserver does not answer our queries, we focused specifically on the domains that will break due to DNS Flag Day effects, those we have been actively chasing.</p>
<div>
    <a href="https://plot.ly/~secastro/171/?share_key=V3TOcLU4SzAhJY2SiuRdAj" target="_blank" title="Plot 171" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/171.png?share_key=V3TOcLU4SzAhJY2SiuRdAj" alt="DNS Flag day, the aftermath" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:171" sharekey-plotly="V3TOcLU4SzAhJY2SiuRdAj" src="https://plot.ly/embed.js" async></script>
</div>
<p>This is a totally different picture! We started with over 8000 domains affected (1.2% of the registry) down to 508 domains (0.07% of the registry). That's a great improvement, considering we also focused on popular domains names that were in the list such as a few banks, Government agencies and media outlets.</p>
<h2 id="howdnsflagdaywaslivedbyothermembersofthecommunity">How DNS Flag Day was lived by other members of the community</h2>
<p>During Saturday 2 February NZDT there was continous activity on Twitter about the day. Below a few selected links and tweets about it.</p>
<ul>
<li>ISC, one of the DNS Flag Day organizers and authors of the DNS Compliance test, wrote a <a href="https://www.isc.org/blogs/dns-flag-day-was-it-a-success/">blog post</a> wrapping up the day.<br>
<img src="https://blog.nzrs.net.nz/content/images/2019/02/Screen-Shot-2019-02-04-at-4.38.37-PM.png" alt="DNS Flag day, the aftermath"></li>
<li>PowerDNS <a href="https://blog.powerdns.com/2019/02/01/changes-in-the-powerdns-recursor-4-2-0/">announcement</a> of a new version of PowerDNS Recursor without the workarounds.</li>
<li>Quad9 <a href="https://quad9.net/dns-flag-day-2019/">announcement</a></li>
<li>Google DNS <a href="https://groups.google.com/d/msg/public-dns-announce/-qaRKDV9InA/CsX-2fJpBAAJ">announcement</a> and <a href="https://groups.google.com/d/msg/public-dns-announce/-qaRKDV9InA/tExCFrppAgAJ">update</a></li>
<li>Domain Pulse picking up our <a href="http://www.domainpulse.com/2019/02/01/today-is-dns-flag-day-and-globally-thousands-of-domains-likely-to-break/">media release</a></li>
<li>Thousand Eyes and their guide to <a href="https://blog.thousandeyes.com/surviving-dns-flag-day/">surviving</a> DNS Flag Day</li>
<li>And some of our colleagues in other ccTLDs produced some updates on Twitter<br>
<img alt="DNS Flag day, the aftermath" src="https://blog.nzrs.net.nz/content/images/2019/02/Screen-Shot-2019-02-04-at-4.47.33-PM.png" style="width: 48%; float: left; left: 30%;"><img alt="DNS Flag day, the aftermath" src="https://blog.nzrs.net.nz/content/images/2019/02/Screen-Shot-2019-02-04-at-4.44.16-PM.png" style="width: 48%; left: 40%;"></li>
</ul>
</div>]]></content:encoded></item><item><title><![CDATA[Improving data workflows with Airflow and PySpark]]></title><description><![CDATA[<div class="kg-card-markdown"><p>Within the Technical Research team, we have developed many data workflows in a variety of projects. These workflows normally need to run on a schedule, contain multiple tasks to execute and a network of data dependencies to manage. We have a requirement to monitor the execution of a workflow to</p></div>]]></description><link>https://blog.nzrs.net.nz/improving-data-workflows-with-airflow-and-pyspark/</link><guid isPermaLink="false">5c073afe0e3b8700bf35db4d</guid><category><![CDATA[Airflow]]></category><category><![CDATA[PySpark]]></category><category><![CDATA[Workflow management]]></category><dc:creator><![CDATA[Jing Qiao]]></dc:creator><pubDate>Sun, 16 Dec 2018 20:46:59 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2018/12/art-blur-bokeh-1165652.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2018/12/art-blur-bokeh-1165652.jpg" alt="Improving data workflows with Airflow and PySpark"><p>Within the Technical Research team, we have developed many data workflows in a variety of projects. These workflows normally need to run on a schedule, contain multiple tasks to execute and a network of data dependencies to manage. We have a requirement to monitor the execution of a workflow to make sure each task is successful, and when there's a failure, we can quickly locate the problem and resume the workflow later.</p>
<p>We found <a href="https://github.com/apache/incubator-airflow">Apache Airflow</a> meets our needs to manage workflows. It's an open-source platform for describing, executing and monitoring workflows, originally built by Airbnb and now widely used by many companies. This post is not meant to be an extensive tutorial for Airflow, instead, we'll take the <a href="https://blog.nzrs.net.nz/two-years-of-nz-zone-scans/">Zone Scan data processing</a> as an example, to show how Airflow improves workflow management.</p>
<p>We also wanted to speed up our big data analysis by migrating <a href="https://en.wikipedia.org/wiki/Apache_Hive">Hive</a> queries to <a href="https://spark.apache.org/">Apache Spark</a>. We'll introduce how we use <a href="http://spark.apache.org/docs/2.2.0/api/python/pyspark.html">PySpark</a> in an Airflow task to achieve this purpose.</p>
<h2 id="currentworkflowmanagement">Current workflow management</h2>
<p>We didn't have a common framework for managing workflows. Workflows created at different times by different authors were designed in different ways. For example, the Zone Scan processing used a <a href="https://en.wikipedia.org/wiki/Makefile">Makefile</a> to organize jobs and dependencies, which is originally an automation tool to build software, not very intuitive for people who are not familiar with it.</p>
<h2 id="migratingtoairflow">Migrating to Airflow</h2>
<p>Airflow is a modern system specifically designed for workflow management with a Web-based User Interface. We explored this by migrating the Zone Scan processing workflows to use Airflow.</p>
<p>An Airflow workflow is designed as a <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">DAG</a> (Directed Acyclic Graph), consisting of a sequence of tasks without cycles. The structure of a DAG can be viewed on the Web UI as in the following screenshot for the <strong>portal-upload-dag</strong> (one of the workflows in the Zone Scan processing).<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/12/dag_view-3.jpg" alt="Improving data workflows with Airflow and PySpark"></p>
<p><strong>portal-upload-dag</strong> is a workflow to generate reports from the Zone Scan data and upload them to the <a href="https://idp.nz/Domain-Names/nz-Zone-Scan/ep35-2s5u">Internet Data Portal</a> (IDP). We can clearly see the three main tasks and their dependencies (running in the order indicated by the arrows):</p>
<ol>
<li><strong>getdata-subdag</strong>: to extract all data needed on the reports</li>
<li><strong>prepare-task</strong>: to prepare the data in a format ready for uploading</li>
<li><strong>upload-task</strong>: to upload to the IDP.</li>
</ol>
<p>To make a complex DAG easy to maintain, sub-DAG can be created to include a nested workflow, such as <strong>getdata-subdag</strong>. Sub-DAG can be zoomed in to show the tasks contained. Below is the graph view after zooming into <strong>getdata-subdag</strong>:<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/12/subdag_view.jpg" alt="Improving data workflows with Airflow and PySpark"></p>
<p>The status of the tasks for the latest run are indicated by colour, making it very easy to know what's happening at a glance.</p>
<p>You can interact with a task through the web UI. This is often useful when debugging a task, you want to manually run an individual task ignoring its dependencies. Some actions can be performed to a task instance as shown in the following screenshot:<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/12/task-action-1.jpg" alt="Improving data workflows with Airflow and PySpark"></p>
<p>The Airflow UI contains many other views that cater for different needs, such as inspecting DAG status that spans across runs, Gantt Chart to show what order tasks run and which task is taking a long time (as shown in the following screenshot), task duration historical graph, and allowing to drill into task details for metadata and log information, which is extremely convenient for troubleshooting.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/12/gantt.jpg" alt="Improving data workflows with Airflow and PySpark"></p>
<p>Multiple workflows can be monitored in Airflow through the following view:<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/12/dag_list.jpg" alt="Improving data workflows with Airflow and PySpark"></p>
<p>A list of DAGs in your environment is shown with summarized information and shortcuts to useful pages. You can see how many tasks succeeded, failed, or are currently running at a glance.</p>
<p>To use Airflow, you need to write Python scripts to describe workflows, which increases flexibility. For example, a batch of tasks can be created in a loop, and dynamic workflows can be generated in various ways. Different types of operators can be used to execute a task, such as BashOperator to run a Bash command, and PythonOperator to call a Python function, specific operators such as HiveOperator, S3FileTransformOperator, and more operators built by the community. Tasks can be configured with a set of arguments, such as schedule, retries, timeout, catchup, and trigger rule.</p>
<p>Airflow also has more advanced features which make it very powerful, such as branching a workflow, hooking to external platforms and databases like Hive, S3, Postgres, HDFS, etc., running tasks in parallel locally or on a cluster with task queues such as <a href="http://www.celeryproject.org/">Celery</a>.</p>
<p>Airflow can be integrated with many well-known platforms such as Google Cloud Platform (GCP) and Amazon Web services (AWS).</p>
<h2 id="runningpysparkinanairflowtask">Running PySpark in an Airflow task</h2>
<p>We use many Hive queries running on Hadoop in our data analysis, and wanted to migrate them to Spark, a faster big data processing engine. As we use Python in most of our projects, PySpark (Spark Python API) naturally becomes our choice.</p>
<p>With the <a href="https://spark.apache.org/sql/">Spark SQL module</a> and HiveContext, we wrote python scripts to run the existing Hive queries and UDFs (User Defined Functions) on the Spark engine.</p>
<p>To embed the PySpark scripts into Airflow tasks, we used Airflow's BashOperator to run Spark's spark-submit command to launch the PySpark scripts on Spark.</p>
<p>After migrating the Zone Scan processing workflows to use Airflow and Spark, we ran some tests and verified the results. The workflows were completed much faster with expected results. Moreover, the progress of the tasks can be easily monitored, and workflows are more maintainable and manageable.</p>
<h2 id="futurework">Future work</h2>
<p>We explored Apache Airflow on the Zone Scan processing, and it proved to be a great tool to improve the current workflow management. We also succeeded to integrate PySpark scripts with airflow tasks, which sped up our data analysis jobs.</p>
<p>We plan to use Airflow as a tool in all our projects across the team. In addition, a centralized platform can be established for all the workflows we have, which will definitely bring our workflow management to a new level.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Detecting resolvers at .nz]]></title><description><![CDATA[<div class="kg-card-markdown"><p>This is a follow-up post to summarise the work of <a href="https://indico.dns-oarc.net/event/29/contributions/655/attachments/635/1042/Resolver_detection_using_machine_learning_-_OARC29-02.pdf">resolver detection presented at DNS-OARC 29</a>. We built a classifier that can tell, with certain probability, if a source address observed at .nz represents a DNS resolver or not. Started two years ago, it has been a trail-blazing task with</p></div>]]></description><link>https://blog.nzrs.net.nz/detecting-resolvers-at-nz/</link><guid isPermaLink="false">5bd63889cd17dc00bff2420d</guid><dc:creator><![CDATA[Jing Qiao]]></dc:creator><pubDate>Mon, 12 Nov 2018 20:13:19 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/banter-snaps-zgtTjXKxdDE-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/banter-snaps-zgtTjXKxdDE-unsplash.jpg" alt="Detecting resolvers at .nz"><p>This is a follow-up post to summarise the work of <a href="https://indico.dns-oarc.net/event/29/contributions/655/attachments/635/1042/Resolver_detection_using_machine_learning_-_OARC29-02.pdf">resolver detection presented at DNS-OARC 29</a>. We built a classifier that can tell, with certain probability, if a source address observed at .nz represents a DNS resolver or not. Started two years ago, it has been a trail-blazing task with multiple iterations of exploration and improvement. The core work has been covered in my previous posts: <a href="https://blog.nzrs.net.nz/source-address-clustering-feature-engineering/">&quot;Source Address Classification - Feature Engineering&quot;</a> and <a href="https://blog.nzrs.net.nz/source-address-classification-clustering/">&quot;Source Address Classification - Clustering&quot;</a>. Here we'll summarise the final results, share the classifier's output for other DNS operators and interested parties to review.</p>
<h2 id="purpose">Purpose</h2>
<p>Our original intention for detecting resolvers was to remove noise from the DNS traffic used to calculate the domain popularity ranking we developed at .nz. As we observed, DNS traffic is noisy - containing monitoring hosts and spontaneous sources of unknown origin. We sought to only consider queries representing user's activity on the Internet.</p>
<h2 id="removingnoise">Removing noise</h2>
<p>Using our DNS expertise, we can identify and remove some likely noise to prepare a cleaner and smaller dataset for a machine learning model. We took the DNS traffic data across four weeks between 28 August 2017 and 24 September 2017 for our analysis. There were two million unique source addresses in this period, of which; 27.8% only queried for one domain; 45.5% only queried for one query type; 25% only queried one of seven .nz name servers; and 65.8% sent no more than 10 queries per day. We assumed these sources either don't behave like a typical resolver or generate so little traffic to .nz to make it representative. By removing these sources we reduced the number of addresses to 550K.</p>
<p>Within 550K source addresses, some are only active for a short time or a few days. We assumed the inactive sources don't represent the main population of .nz or are not stable IPs. We extracted the most active sources, those that were visible five out of seven days per week, and at least 75% of the total hours (a threshold picked according to the plot below). This gave us 82K.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/11/hours-2.svg" alt="Detecting resolvers at .nz"></p>
<p>We assumed these 82K sources can be divided into two classes: resolvers and monitoring hosts. Next we used machine learning to solve the classification.</p>
<h2 id="machinelearning">Machine learning</h2>
<p>In machine learning, we need a feature set and some training data to build a supervised classifier that can predict the probability of a source being a resolver. We explored some known resolvers and monitors, which showed diverse behaviors and no clear demarcation based on a single feature (see the plot below for example). The key is to find feature combinations that can help discriminate both types. We repeatedly tested clustering results following the iterative addition of new features. Once we achieved a good clustering result, we built a final classifier based on these features.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/11/rm1-2.svg" alt="Detecting resolvers at .nz"><img src="https://blog.nzrs.net.nz/content/images/2018/11/rm2-2.svg" alt="Detecting resolvers at .nz"></p>
<h3 id="featureset">Feature set</h3>
<p>We choose a four-week period for our training so it will include both daily and weekly patterns. We used features representing various aspects and elements of the DNS protocol.</p>
<p>For each source, we extracted the proportions of DNS flags, common query types, and response codes. For activity, we calculated the fraction of visible weekdays, days and hours.</p>
<p>Aggregated by day, we constructed time series for query count, unique query types, and unique query names. We then generated features for these time series using descriptive statistics such as mean, standard deviation and percentiles.</p>
<p>We also created timing entropy and query name entropy features. These were based on the assumption that a resolver's behaviour should be more random than a monitor, or in other words, there should be a bigger entropy in a resolver's query stream than in a monitor's. For timing entropy, we calculated the time lag between successive queries and between successive queries of the same query name and query type. For query name entropy, we use <a href="https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance">Jaro-Winkler string distance</a> to calculate the similarity of the query names between successive queries.</p>
<p>Combining the above features, we still did not get a satisfactory clustering result. Further, we came up with features that could catch the variability of a query flow considering that a monitor's query flow should be less variable compared to a resolver. We aggregated the query rates by hour (query frequencies, number of unique query types and query names) and entropy features, and then calculated a set of variance metrics including <a href="https://en.wikipedia.org/wiki/Interquartile_range">Inter-quartile Range</a>, <a href="https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion">Quartile Coefficient of Dispersion</a>, <a href="https://en.wikipedia.org/wiki/Mean_absolute_difference">Mean Absolute Difference</a>, <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">Median Absolute Deviation</a>, <a href="https://en.wikipedia.org/wiki/Coefficient_of_variation">Coefficient of Variation</a>.</p>
<p>In total, we came up with 66 features for a given source address. We removed those with correlation score above 0.95 to reduce the redundancy of our model. We then checked the relevance of the rest features according to the labeled samples we had. As shown in the plot below, using <a href="http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test">F-test</a> and <a href="https://en.wikipedia.org/wiki/Mutual_information">Mutual Information (MI)</a>, we could see which features were relevant and which were not. Removing the irrelevant features, we finally ended up with 50 features.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/11/hf-1.svg" alt="Detecting resolvers at .nz"><img src="https://blog.nzrs.net.nz/content/images/2018/11/lf-1.svg" alt="Detecting resolvers at .nz"></p>
<h3 id="verifythefeaturesbyclustering">Verify the features by clustering</h3>
<p>Clustering is an approach to explore how the source addresses are grouped together based on similar patterns. We sought to verify our feature set by clustering similar sources and separating different ones. We tried a range of algorithms including K-Means, Gaussian Mixture Model, MeanShift, DBSCAN and Agglomerative Clustering, with hyperparameter tweaks. Evaluated the models using a couple of metrics such as Adjust Rand Index, Homogeneity score and Completeness score, we searched out the best performance model using Gaussian Mixture Model and ended up with five clusters. The plot below shows how known samples distribute across the five clusters.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/11/hm-3.svg" alt="Detecting resolvers at .nz"></p>
<p>The number in the cell of the heatmap is the fraction of a particular type of samples, for example ICANN monitors for the first row, that fall into a cluster. We can see monitors are mostly in cluster 1, except many of RIPE Atlas Probes are distributed in Cluster 0 and 3. This is probably because of those RIPE Atlas Probes not doing monitoring, so they behave differently from monitors. For resolvers, Google DNS and ISPs are mostly in Cluster 4, while OpenDNS's behaviour is completely different from other resolvers. We temporarily set aside OpenDNS for its unexpected behavior and will explore that in the future. Then we aggregated the rest of the resolvers and monitors. From the heatmap below, we can see most of monitors are in Cluster 1, and most of resolvers are in Cluster 2 and 4.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/11/hm_tt.svg" alt="Detecting resolvers at .nz"></p>
<p>The clustering is not perfect, but it shows the power of our feature set in separating different source types correctly in general. We then used this feature set to train a classifier.</p>
<h3 id="supervisedclassifier">Supervised classifier</h3>
<p>Our training data composed of 2515 resolvers and 106 monitors. The resolver set came from ISPs in New Zealand, Google DNS, OpenDNS, education and research organisations in New Zealand and resolvers used by RIPE Atlas probes. The monitors came from ICANN, Pingdom, ThousandEyes and RIPE Atlas probes and anchors. We extracted the 50 features across the four week period in our DNS traffic for each source address to build a supervised classifier.</p>
<p>Using <a href="http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning">Automated Machine Learning</a> technique with efficient Bayesian optimisation methods, we trained an ensemble of 28 models in 10 minutes and achieved 0.991 accuracy and 0.995 F1 score.</p>
<h2 id="predictionresult">Prediction result</h2>
<p>For each source address, we let the classifier predict its probability of being like a resolver. We then labeled those with a probability higher than 0.7 as resolvers. We tested this prediction with our domain popularity ranking algorithm, and it observably improved the ranking accuracy of some known domains.</p>
<p>We would like to share the prediction based on our traffic from 3 September 2018 to 30 September 2018 with anyone interested to review with their own data. You can download it <a href="https://drive.google.com/open?id=1qtxG33-dipdu7EYLI9KRBpMkkxY4Ih6X">here</a> (SHA1 checksum: 6448202ad39be55c435ba671843dac683b529bce).</p>
<h2 id="potentialuseandfuturework">Potential use and future work</h2>
<p>This work has several applications. It can improve the accuracy of domain popularity ranking. Further, with some adjustment of feature set and training data, it can be expanded to other uses using DNS passive captures. For example, to measure the adoption of new technologies such as DNSSEC validating resolvers, QNAME minimisation in the wild, etc.. The method can be applied to a different dataset like the root zone and other TLDs and results can be shared and compared.</p>
</div>]]></content:encoded></item><item><title><![CDATA[DNS Flag day]]></title><description><![CDATA[In this blog post, we introduce the upcoming "DNS Flag day", how it will affect .nz domains and how .nz compares to other ccTLDs.]]></description><link>https://blog.nzrs.net.nz/dns-flag-day/</link><guid isPermaLink="false">5ba023dc39e74300bf2ad0af</guid><category><![CDATA[DNS]]></category><category><![CDATA[Research]]></category><dc:creator><![CDATA[Sebastian Castro]]></dc:creator><pubDate>Mon, 08 Oct 2018 20:24:38 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/DNS_Flag_Day_hero_07.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/DNS_Flag_Day_hero_07.png" alt="DNS Flag day"><p>There are times when decisive action is the straightest path to success. Starting from 1 February 2019, the organisations behind open source DNS software implementations are going to deploy changes to their code that could break your domains. That day has been labeled <em>DNS Flag day</em>.</p>
<p>Do software developers want to intentionally break domains? Well, no. For years, those software developers had to include workarounds in their code to allow a few domains to work; domains using DNS software that's not standard compliant, or living behind network devices not respecting Internet standards. Those workarounds are coming to an end. If you run a domain name and want to get more information, please check <a href="https://dnsflagday.net">https://dnsflagday.net</a>, which includes an online tool for testing.</p>
<p>As the guardians of the .nz namespace, we see it as our responsibility to investigate how this change will affect .nz, and we have been collecting information about DNS standard compliance across all .nz domains for a couple of months. The research involved was presented at <a href="http://www.lacnic.net/3011/46/evento/agenda">LACNIC 30</a> and will be presented at <a href="https://indico.dns-oarc.net/event/29/">DNS-OARC 29</a> in the coming weeks.</p>
<p><strong>What do we test for</strong></p>
<p>The workarounds to be removed starting on February 2019 are related to a component of the DNS called <strong>EDNS</strong>. EDNS was created to extend optionality and usefulness of the DNS protocol. For example, there couldn't be DNSSEC without EDNS.</p>
<p><a href="https://www.isc.org">ISC</a>, the organisation behind BIND, the de-facto standard DNS implementation, created a test to verify if a DNS server responds correctly to a series of queries exploring different elements of the DNS standard, including EDNS. They have been <a href="http://ednscomp.isc.org/">collecting</a> compliance data for the root zone and other domains for a while.</p>
<p><a href="https://www.nic.cz/">CZ.NIC</a>, managers of the Czech Republic ccTLD, created a tool that tests a nameserver once, independently of how many domains it hosts, allowing bulk verification of a whole namespace like .cz or .nz.</p>
<p>We are using the CZ.NIC tool currently for .nz, and are checking for EDNS compliance. In the future, we will extend to full DNS compliance.</p>
<p><strong>Results</strong></p>
<p>In a coordinated effort with .CL, .CZ, .SE, .NU, .NZ, and using the public results for the root zone, we can compare how different namespaces fare on the test. The figures below are not exhaustive but are the most compelling output.</p>
<p>Our first look is at the general nameserver distribution, as one nameserver can have multiple IPv4 and/or IPv6 addresses.</p>
<div>
    <a href="https://plot.ly/~secastro/155/?share_key=PDyQoLfGe2PXewPuxLd71z" target="_blank" title="Plot 155" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/155.png?share_key=PDyQoLfGe2PXewPuxLd71z" alt="DNS Flag day" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:155" sharekey-plotly="PDyQoLfGe2PXewPuxLd71z" src="https://plot.ly/embed.js" async></script>
</div>
<p>Although different zones have different numbers of domains, the number of servers is more or less stable, with the exception of Sweden with over 10k extra addresses compared to the rest.</p>
<p><strong>Basic DNS test</strong></p>
<blockquote>
<p>dig soa ZONE @SERVER +noedns +noad +norec</p>
</blockquote>
<p>For each nameserver, we send a query to confirm they respond. In general, most of the nameservers pass this test.</p>
<div>
    <a href="https://plot.ly/~secastro/157/?share_key=p4BL7rrxnnVXxZ08BLelGp" target="_blank" title="Plot 157" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/157.png?share_key=p4BL7rrxnnVXxZ08BLelGp" alt="DNS Flag day" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:157" sharekey-plotly="p4BL7rrxnnVXxZ08BLelGp" src="https://plot.ly/embed.js" async></script>
</div>
<p>The root zone has higher levels of correctness on this basic tests because IANA imposes a set of technical tests to TLD operators. From now on, the root zone metric will be a baseline to compare other zones.</p>
<p>The errors &quot;NOSOA&quot; and &quot;NOAA&quot; imply the server didn't send the right response to the query, due to misconfiguration mostly.</p>
<p><strong>EDNS Test</strong></p>
<blockquote>
<p>dig soa ZONE @SERVER <strong>+edns=0</strong> +nocookie +noad +norec</p>
</blockquote>
<p>With the baseline defined, we can start showing how increasing more complex queries start producing errors.</p>
<div>
    <a href="https://plot.ly/~secastro/159/?share_key=MQbP4T5hvL2TtvW1m9pWrl" target="_blank" title="Plot 159" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/159.png?share_key=MQbP4T5hvL2TtvW1m9pWrl" alt="DNS Flag day" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:159" sharekey-plotly="MQbP4T5hvL2TtvW1m9pWrl" src="https://plot.ly/embed.js" async></script>
</div>
<p>From this test we can start seeing the first protocol violations. To activate EDNS, a DNS query will include an OPT record, which is required to be copied in the DNS response. The NOOPT errors are servers not returning that record. The NSID errors are servers returning the NSID option when originally they were not asked to provide it!</p>
<p><strong>DO Test</strong></p>
<blockquote>
<p>dig soa ZONE @SERVER +edns=0 +nocookie +noad +norec <strong>+dnssec</strong></p>
</blockquote>
<p>Having working EDNS is essential for DNSSEC. The DO bit signals a DNS client wants to receive DNSSEC-related records, like RRSIG (signatures) and DNSKEYs. While testing for DO-bit support, we start to find higher levels of failure.</p>
<div>
    <a href="https://plot.ly/~secastro/161/?share_key=ALx0fOIRzPZg2k0uIlyIGl" target="_blank" title="dns_flag_day_do_test" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/161.png?share_key=ALx0fOIRzPZg2k0uIlyIGl" alt="DNS Flag day" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:161" sharekey-plotly="ALx0fOIRzPZg2k0uIlyIGl" src="https://plot.ly/embed.js" async></script>
</div>
<p>From the plot we can see two different stories. The Root, .SE and .NU zones have nearly 100% of nameservers answering correctly, and .NZ, .CZ and .CL slightly less than 80%, with the other 20% failing to include the DO bit on the response as required! There is also a few nameservers that timeout with the query. If there is a signed domain behind those failing servers, DNSSEC will definitely break.</p>
<p><strong>EDNS1 test</strong></p>
<blockquote>
<p>dig soa ZONE @SERVER <strong>+edns=1 +noednsneg</strong> +nocookie +noad +norec</p>
</blockquote>
<p>The EDNS1 test is quite tricky, as EDNS version 1 has not been defined yet, the only version available is EDNS0. So this is a test to verify if the nameserver handles the error correctly, or any potential network device doing packet inspection understands if the query is valid or not.</p>
<div>
    <a href="https://plot.ly/~secastro/163/?share_key=cINJ3CsoGX3cwROUoTrKZC" target="_blank" title="dns_flag_day_edns1_test" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/163.png?share_key=cINJ3CsoGX3cwROUoTrKZC" alt="DNS Flag day" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:163" sharekey-plotly="cINJ3CsoGX3cwROUoTrKZC" src="https://plot.ly/embed.js" async></script>
</div>
<p>You can see that the root zone keeps its high compliance level, but the ccTLDs in our list fall behind with roughly 50% of the nameservers passing the test. The expected response must include a BADVERS return code, no SOA record, and the OPT record signaling EDNS version 0. The <em>noerror</em> and <em>soa</em> cases represent a nameserver that didn't validate the query properly, the <em>noopt</em> case a nameserver that violated the standards by not returning the OPT record as we saw above, and the <em>badversion</em> case where a nameserver actually responded indicating it supports EDNS version 1!</p>
<p><strong>OPTLIST test</strong></p>
<blockquote>
<p>dig soa ZONE @SERVER +edns=0 +noad +norec <strong>+nsid +subnet=0.0.0.0/0 +expire +cookie=0102030405060708</strong></p>
</blockquote>
<p>The OPTLIST test is intended to explore the adoption of new DNS options, as they have been added in later years. <strong>NSID</strong> defined 11 years ago in RFC 5001 asks the server to reply with a server identification string, useful for anycast deployments. <strong>subnet</strong> is an option defined 2 years ago for clients to signal where the original DNS query came from, useful for CDN operators. <strong>expire</strong> is defined in RFC 7314 to query the EXPIRE timer in the SOA record. <strong>cookie</strong> is defined in RFC 7873 and provides a lightweight DNS transaction security mechanism against a variety of attacks. In simple terms, is a gauge of how new and fresh the DNS software used on the nameservers is.</p>
<div>
    <a href="https://plot.ly/~secastro/165/?share_key=eZzuU9sg2CVcGJgEjyVQaA" target="_blank" title="dns_flag_day_optlist_test" style="display: block; text-align: center;"><img src="https://plot.ly/~secastro/165.png?share_key=eZzuU9sg2CVcGJgEjyVQaA" alt="DNS Flag day" style="max-width: 100%;width: 800px;" width="800" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="secastro:165" sharekey-plotly="eZzuU9sg2CVcGJgEjyVQaA" src="https://plot.ly/embed.js" async></script>
</div>
<p>The plot provides two views. First, which options are more commonly deployed like <strong>nsid</strong> and <strong>subnet</strong>. Second, the error cases, as there are a few nameservers failing to respond to the query (timeout) and some returning a DNS error (<strong>formerr</strong>) that means they didn't understand the query, indicating the software is a few years old.</p>
<p><strong>Why does it matter?</strong><br>
We started this article pointing out that changes will be introduced due to <strong>DNS Flag day</strong>. The deployment of these changes will cause currently functioning domains to stop working. We estimate around 1.2% of .nz domains will be broken, and we will notify those registrants and DNS operators about our findings using the Registrar Portal.</p>
<p><strong>Final words</strong><br>
The Internet is a tool for innovation and disruption, but introducing innovation in the core DNS protocols has always proven difficult. There are constant demands to be backward compatible and to avoid making big changes that will break existing features. Consequently, the DNS in particular is a protocol that stayed the same for many years. We will be actively guarding and investigating the level of protocol compliance within the .nz namespace and reporting back our findings. Stay tuned.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Domain popularity across countries]]></title><description><![CDATA[<div class="kg-card-markdown"><p>As the DNS operator of .nz, we manage 4 of the 7 .nz nameservers ourselves and we've been collecting their DNS traffic. The other 3 are hosted by overseas providers and we only have access to the DNS traffic data from 2 of them. In total, our accessible data covers</p></div>]]></description><link>https://blog.nzrs.net.nz/domain-popularity-across-countries/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f6101</guid><dc:creator><![CDATA[Jing Qiao]]></dc:creator><pubDate>Sun, 29 Jul 2018 22:20:17 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/john-moeses-bauan-OGZtQF8iC0g-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/john-moeses-bauan-OGZtQF8iC0g-unsplash.jpg" alt="Domain popularity across countries"><p>As the DNS operator of .nz, we manage 4 of the 7 .nz nameservers ourselves and we've been collecting their DNS traffic. The other 3 are hosted by overseas providers and we only have access to the DNS traffic data from 2 of them. In total, our accessible data covers 6/7 .nz nameservers and about 80% of the total traffic.</p>
<p>Through this DNS traffic, we hope to find out which .nz domains are more popular compared with others and the change of domain popularity across time. With the data from multiple locations around the world, we're able to analyze and compare .nz domain popularity in different countries, and how specific events affect that popularity, which will be explored in this post.</p>
<h2 id="algorithm">Algorithm</h2>
<p>The original algorithm we invented for domain popularity ranking was introduced <a href="https://cdn.nzrs.net.nz/88vkAeP4GA9wm/r4dXMp9Vn0EV-/Domain%20Popularity%20Ranking%20Revisited.pdf">here</a>. As it's still being tested and improved, we'll use a simplified version and some adjustment has been made for the calculation according to country. We use MaxMind GeoIP to map the traffic from an address to a country. By extracting the daily traffic from a specific country <strong>c</strong>, we can calculate a popularity score for a domain <strong>d</strong> as follows:</p>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML">
</script>
<p>$$Popularity\ Score\ (d,\ c)  =  Volume\ Fraction\ (d,\ c)\ \times\ Population\ Fraction (d,\ c)$$<br>
$$Volume\ Fraction\ (d, c) = {\sum queries (d\  from\ c) \over \sum queries (c)}$$<br>
$$Population\ Fraction\ (d, c) = {|\ a \in A: a\ asked\ for\ d\ | \over |\ A:\ sources\ from\ c\ |}$$</p>
<p>A domain with a high popularity score means it's popular in the DNS traffic. By ranking the popularity score in different countries, we'll see which .nz domains are popular in each country.</p>
<p>There are many factors impacting the traffic we observe for a domain. This simplified version of the algorithm doesn’t account for TTL values, resolver behaviours, and user population. The popularity score is then a proxy of popularity.</p>
<p>Next， I'm going to exemplify the analysis for web domains and email domains using data from Dec 2017 to Jan 2018.</p>
<h2 id="webdomainpopularitydec2017jan2018">Web domain popularity (Dec 2017 - Jan 2018)</h2>
<p>We calculated the popularity score for each .nz domain in the total web traffic (A or AAAA queries for a .nz domain itself or with the hostname of 'www' ) and extracted the top domains as listed on the vertical axis of the heat map below. We want to compare the popularity ranking of these domains in the 10 countries that sent the most significant traffic to us during the period in our analysis. The number in a cell represents the ranking value for the domain of that row in the country of that column, and we use the shade of the purple to visualize the numerical magnitude.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/08/world_web.svg" alt="Domain popularity across countries"></p>
<p>We can see that many of these world popular .nz domains, which include cryptocurrency exchanging website, social media platform, online travel booking and hospitality service, are not ranked as high in NZ as in other countries. Then which domains are most popular in NZ? Let's look at the heat map below.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/08/nz_web.svg" alt="Domain popularity across countries"></p>
<p>We see some kiwi favorite radio and news media websites, popular online local marketplace and ISP website on the top list of NZ. We can also find that these NZ popular domains are more popular in AU than in other foreign countries.</p>
<p>The difference of popular domains in NZ and other countries shows that .nz domains have different recognition locally and internationally.</p>
<p>By drawing the bump chart below, we can inspect the top 10 NZ web domains' popularity ranking across time.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/07/bump-chart-1.png" alt="Domain popularity across countries"></p>
<p>neighbourly.co.nz is a New Zealand neighborhood website founded in 2014, and till 2017 it has 460k members, less than 9% of the country population. Despite having a smaller user population than trademe.co.nz, neighbourly.co.nz still achieved a higher popularity score in the DNS traffic. We found that neighbourly.co.nz has a much smaller TTL value which makes its DNS records in resolvers' cache expire faster, thus need to query us more frequently, which would explain the confusion. There're some interesting patterns in the chart. For example, the ranking changes show a weekly cyclicity and a holiday impact for some domains like newstalkzb.co.nz and thehits.co.nz.</p>
<h2 id="emaildomainpopularitydec2017jan2018">Email domain popularity (Dec 2017 - Jan 2018)</h2>
<p>The analysis of domain popularity in email traffic (MX queries for a .nz domain) is more elusive due to the prevalence of spam. We explored the most popular email domains queried from NZ and US. For each country, we use a box plot to show the most popular email domains they queried and the ranking distribution of each domain across the time in analysis.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/07/nz-mx.png" alt="Domain popularity across countries"></p>
<p>Per NZ, half of the list is email service provided by some ISP, such as vodafone.co.nz, kinect.co.nz (Trustpower), xtra.co.nz (Spark), and inspire.net.nz. The rest are from Trademe and government. From the plot, we can see that ird.govt.nz (Inland Revenue Department) shows a big fluctuation due to it's used in communications only happened on a business day.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/07/us-mx.png" alt="Domain popularity across countries"></p>
<p>Popular email domains queried from US contain some .nz domains of Yahoo, Hotmail and Outlook email services. I'm not sure how many people use these email domains today, but I guess it's possible that they were targeted by spammers.</p>
<h2 id="impactofevents">Impact of events</h2>
<p>Internet traffic is affected by human activities. From the changes in domain popularity ranking, we expect to find some interesting stories about the human world. Here're some examples of domains' popularity in New Zealand impacted by big events.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/07/cricket-event.png" alt="Domain popularity across countries"></p>
<p>Above is the trend of popularity ranking of radiosport.co.nz website across Dec 2017 and Jan 2018 in five countries. All five lines went up in the same period across late December and early January. During that time, the West Indies cricket team was on tour in NZ and they played a number of games against the Black Caps. The other impacting event could be a well-known sports journalist for Radio Sports resigned on Dec 19 and subsequently they changed their weekend programme line-up. So the jump could have been due to an increased advertising campaign as a result of that.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/07/nzqa.png" alt="Domain popularity across countries"></p>
<p>The New Zealand Qualifications Authority, nzqa.govt.nz, spiked on the day when New Zealand's secondary school examination result was published. People went to the website to check their performance thus made the web domain more popular than usual.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/07/ird-holidy.png" alt="Domain popularity across countries"></p>
<p>The email domain of Inland Revenue Department, ird.govt.nz, dipped at Christmas and New Year holiday, well matched their working time.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we used our DNS traffic to analyze .nz domain popularity across different countries based on a simple algorithm. We compared the most popular domains in New Zealand and some foreign countries, and demonstrated how specific events affect the domain popularity ranking. Domain popularity analysis provides us with insight into the use and trend of the .nz namespace. The algorithm still has space to improve to generate a more accurate result that can provide even more valuable insights. Finally, we want to thank Dave Baker for the information about real-life events to better explain the plots.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Source Address Classification - Clustering]]></title><description><![CDATA[<div class="kg-card-markdown"><p>My previous post <a href="https://blog.nzrs.net.nz/source-address-clustering-feature-engineering/">&quot;Source Address Classification - Feature Engineering&quot;</a> introduced the background of the source address classification problem and a critical part of the work - generating the features to be fed into a machine learning model. Before training a supervised classifier, we want to do clustering to</p></div>]]></description><link>https://blog.nzrs.net.nz/source-address-classification-clustering/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60ff</guid><category><![CDATA[DNS traffic analysis]]></category><category><![CDATA[Cluster analysis;]]></category><dc:creator><![CDATA[Jing Qiao]]></dc:creator><pubDate>Fri, 08 Jun 2018 00:02:30 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2018/06/SoundWaves.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2018/06/SoundWaves.jpg" alt="Source Address Classification - Clustering"><p>My previous post <a href="https://blog.nzrs.net.nz/source-address-clustering-feature-engineering/">&quot;Source Address Classification - Feature Engineering&quot;</a> introduced the background of the source address classification problem and a critical part of the work - generating the features to be fed into a machine learning model. Before training a supervised classifier, we want to do clustering to explore the inherent grouping of the data to get more ground truth. By clustering, we can also test our feature set to see if it captures the differences between a resolver's traffic and a monitor's traffic.</p>
<p>Clustering is an unsupervised learning method, unlike supervised learning, it doesn't learn from a prior knowledge (training data), but infers the natural structure present within the data. By clustering, we expect to group source addresses with similar features, thus each group is likely to contain a certain type of sources (resolver, monitor or unknown).   In this post, we'll share our experience of clustering source addresses based on their DNS queries to .nz name servers.</p>
<h2 id="datapreprocessing">Data preprocessing</h2>
<p>As mentioned in <a href="https://blog.nzrs.net.nz/source-address-clustering-feature-engineering/">my previous blog</a>, by removing noise and filtering out low active source addresses, our dataset was reduced to 82k samples. Each sample has 49 features,  the values of which vary in different scales. For example, the feature of 'the number of unique query types per day' has values in [1, 16], while another feature, 'the number of unique domains queried per day', varies from 1 to 1M. Algorithms that rely on geometrical distances between data points, such as K-Means and Gaussian Mixture Model, are sensitive to feature scales. The features with high magnitudes could dominate the objective function and make the model unable to learn from other features.</p>
<p>To make each of our features weight equivalently in the clustering, we need to normalize the data so that all features are on a similar scale. There're various scalers and transformers available <a href="http://scikit-learn.org/stable/modules/preprocessing.html">here</a>, and their effects on clustering are quite subtle. I picked Standardization and Quantile Transformation as the two options included in our evaluation. Standardization is generally used in machine learning models, which standardizes each feature by shifting the mean to 0 and scaling the variance to 1. However, it's sensitive to the presence of outliers as illustrated in <a href="http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py">this example</a>. Quantile Transformer is robust to outliers and can transform the features to follow a Normal Distribution, which is assumed by many machine learning algorithms.</p>
<h2 id="choosingalgorithms">Choosing algorithms</h2>
<p>There're different types of clustering algorithms based on the methodology used and the definition of clusters. Each algorithm has its pros and cons as compared <a href="http://scikit-learn.org/stable/modules/clustering.html">here</a>. It's very common that an algorithm that works well on one dataset could fail on a different dataset. The more you know about your data, the easier for you to choose the right algorithm. However, most of the time, we don't know much about our data, that's why we do clustering to get some insights. In addition, all algorithms have parameters; how to pick settings for the parameters also depends on your data. Due to these traits, choosing the right clustering algorithm is a process full of experimentation and exploration. As all the practical machine learning problems, the principle is to try a most common, easier and faster algorithm before looking at more complex ones.</p>
<p>When choosing a clustering algorithm, we need to consider multiple factors, such as stability, scalability, performance, parameter intuition, etc. Besides, some algorithms allow us to choose the metric used to measure the similarity between data points, which could yield a different clustering result. While Euclidean Distance is often the default and most commonly used, a variety of <a href="http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html">distance metrics</a> are available. You could get a different clustering result by choosing a different similarity metric.</p>
<h2 id="modelselection">Model selection</h2>
<p>Grid search is commonly used in supervised learning for hyper-parameter tuning and model selection. But for unsupervised learning like clustering, there's no existing method that automates the model selection process. This is possibly due to the fact that there's no evaluation standard universally applied to every case.</p>
<p>A range of metrics exists for clustering results assessment. Most of the metrics require the knowledge of the ground truth classes of the samples, which is rarely available in practice, such as Adjust Rand Index, Homogeneity Score, and Completeness Score. A few internal metrics don't require knowledge of the ground truth classes, such as Silhouette Score that measures how well the resulting clusters are separated.</p>
<p>Our objective is to separate resolvers and monitors, and we have a set of samples labeled with these two types, so we can calculate the scores on these samples using their ground truth classes and predicted classes. Finally, I choose Adjust Rand Index as the criteria for selecting the best performing model on our data.</p>
<p>We tried K-Means, Gaussian Mixture Model, MeanShift, DBSCAN and Agglomerative Clustering with adjustments of hyper-parameters. The model with the highest Adjust Rand Index is Gaussian Mixture with 5 clusters, using feature standardization. It has Adjust Rand Index of 0.849789 (1 is the perfect match), Homogeneity Score of 0.960086 (1 means the best homogeneity of each cluster), Completeness Score of 0.671609 (1 means all members of a given class are assigned to the same cluster). The lower completeness score is due to that resolvers are separated into multiple clusters, which is acceptable as long as they are different clusters from which the monitors fall in.</p>
<h2 id="verificationwithgroundtruth">Verification with ground truth</h2>
<p>We collected some known addresses from the sources below:</p>
<ul>
<li>
<p>Monitors:</p>
<ul>
<li>ICANN monitoring</li>
<li>Pingdom monitoring</li>
<li>ThousandEyes monitoring</li>
<li>RIPE Atlas Probes</li>
<li>RIPE Atlas Anchors</li>
<li><a href="https://amp.wand.net.nz/">AMP</a></li>
</ul>
</li>
<li>
<p>Resolvers:</p>
<ul>
<li>ISP</li>
<li>Google DNS</li>
<li>OpenDNS</li>
<li>Education &amp; Research: Universities, research institutes, REANNZ</li>
<li>Addresses collected from RIPE Atlas probes: we ran an experiment involving RIPE Atlas probes around the world and collected the addresses of resolvers they used to send DNS queries.</li>
</ul>
</li>
</ul>
<p>Not all of them showed up in our data, and some were just not active enough and were filtered out as noise. Five clusters are predicted by the model selected with the highest score on these known addresses, as described in the last section. Now let's take a look at their distribution across these clusters. Here's a summarized table and a normalized heat map (Monitors are marked in red to distinguish them from resolvers):<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/df_hm.png" alt="Source Address Classification - Clustering"><br>
<img src="https://blog.nzrs.net.nz/content/images/2018/06/hm-1.png" alt="Source Address Classification - Clustering"><br>
We have the following observations:</p>
<ul>
<li>Only 106 out of 15k monitor addresses and 2757 out of 4490 resolvers are in the table. Those missing samples are either invisible in our query traffic or removed as noise in our data cleaning step. RIPE Atlas Anchors and AMP monitors are completely absent in our test.</li>
<li>Cluster #1 captures all monitoring samples from ICANN, Pingdom and ThousandEyes, while it doesn't capture most RIPE Atlas Probes. The reason why some RIPE Atlas Probes behaved differently from other monitors could be the User-defined Measurements (UDM) with random behaviors.</li>
<li>86% of OpenDNS resolvers also fall into Cluster #1, but referring to the numerical table above, there're only 7 samples from OpenDNS in our dataset, which makes the data less significant. But why only 7 showed up while we have 90 OpenDNS samples in total?  We found that many OpenDNS addresses were removed as noise in the data cleaning due to low traffic and visibility. We need to further look into OpenDNS's specific behavior.</li>
<li>All Google DNS resolvers fall into Cluster #2, #4, well captured by the model.</li>
<li>We collected 2199 addresses of resolvers used by RIPE Atlas probes, they're dispersed over all clusters. By default, a RIPE Atlas probe acquires DNS resolver information using DHCP. It can also be configured to use a target resolver in some DNS measurements. So this sample set is likely to contain various resolvers, big and small, common and uncommon (like resolvers set up for experiments). That could be the reason for the dispersion.</li>
</ul>
<p>We remove samples from RIPE Atlas Probes, OpenDNS, RIPE Atlas collected resolvers, whose pattern is not clear, and then aggregate the rest resolvers and monitors. We can see a clearer pattern:<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/06/hm_t-3.png" alt="Source Address Classification - Clustering"><br>
<img src="https://blog.nzrs.net.nz/content/images/2018/06/dft.png" alt="Source Address Classification - Clustering"></p>
<p>Cluster #1 captures the monitors, while Cluster #2, #4 capture 97% resolvers. Cluster #0, #3 are two big clusters and we don't have many known samples fall in, so their patterns are not clear.</p>
<h2 id="nextstep">Next step</h2>
<p>Our clustering model segregates the monitors and some major resolvers quite well. There're still some unclear patterns, and we need to improve our ground truth to further verify the model. Some cutting-edge techniques can be applied to visualize high dimensional data and interpret the model to help us better understand the clustering result. We'll explain that later.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Analysis of registrations  from registrant point of view]]></title><description><![CDATA[<div class="kg-card-markdown"><p><em>This is a follow-up post on registrant classification. Before reading this post, make sure to check out <a href="https://blog.nzrs.net.nz/registrant-classification/">Registrant Classification using Machine Learning</a>.</em></p>
<p>In <a href="https://blog.nzrs.net.nz/registrant-classification/">the last post</a>, we introduced the models we tried to solve the registrant classification problem. In this post, we will have a look at our final model</p></div>]]></description><link>https://blog.nzrs.net.nz/registrant-classification-ii/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60fe</guid><dc:creator><![CDATA[Huayi Jing]]></dc:creator><pubDate>Mon, 28 May 2018 03:43:42 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/nasa-Q1p7bh3SHj8-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/nasa-Q1p7bh3SHj8-unsplash.jpg" alt="Analysis of registrations  from registrant point of view"><p><em>This is a follow-up post on registrant classification. Before reading this post, make sure to check out <a href="https://blog.nzrs.net.nz/registrant-classification/">Registrant Classification using Machine Learning</a>.</em></p>
<p>In <a href="https://blog.nzrs.net.nz/registrant-classification/">the last post</a>, we introduced the models we tried to solve the registrant classification problem. In this post, we will have a look at our final model and the analysis of registrations based on the classifications.</p>
<p>To achieve the best accuracy, we tried out different feature extraction methods and various models. The final model we used is an ensemble method called stacking classifier, where we have a set of base learners to train and generate predictions, and a meta classifier to learn to combine these predictions to make a final prediction. To make it even better, a Bayesian optimised search was applied to the classifiers to find the optimal hyper parameters.  The best accuracy we achieved is 96.7%.</p>
<p>Let’s start by looking at the whole register. Snapshots of registrant data are taken weekly from 2016-09-19 to 2018-03-19, and the registrants are classified into person or organisation. The following figure shows the change in the number of registrants of person/organisation types. The total number of registrants shows a steady growth from 2016-09-19 to 2018-03-19, with a 17% increase. The number of organisation registrants is 50% more than the person registrants and the ratio also shows an upward trend. There were drops due to special events and at each drop, the organisation registrants seem to drop more first and then quickly catch up again. In March 2018, there are 596,479 registered companies in New Zealand. With 155856 NZ based organisation registrants on 2018-03-19, this means 26.13% of the companies in NZ have at least one domain registered.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/per_org_ratio-1.png" alt="Analysis of registrations  from registrant point of view" style="width: 700px;">
<p>The portfolio size ranges from 1 to more than 2600 domains. The figure below shows that most of the registrants are small portfolio holders with less than 5 domains. 67% of registrants owns 30% of domains. Only less than 1% of registrants have more than 50 domains.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/cumulative_r_domain.png" alt="Analysis of registrations  from registrant point of view" style="width: 650px;">
<p>The bar chart below illustrates that, for small portfolio groups, the proportion between person and organisation is even. But for larger portfolio groups, domains are mostly held by organisations. The interesting point is that, among the top 18 portfolio holders, 6 of them are persons. In fact, the largest portfolio holder, who has more than 2600 domains (almost doubling the size of second largest portfolio holder), is a person!</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/domain_per_r.png" alt="Analysis of registrations  from registrant point of view" style="width: 650px;">
<p>An important question we wanted to answer was: whether the domains owned by one type of registrants tend to live longer than those owned by another type? i.e. do they behave differently in terms of retention? The figure below reveals that organisations owned most of the domains that are older than 8 years. Taking a close look at data on 2016-09-19 and 2018-03-19, 82.18% of the organisation owned domains stay active in the register, while the percentage of person owned domains is lower. This provides us a proof that domains owned by different type of registrants do behave differently and the organisation owned domains do have a higher retention rates.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/age_dist_2-1.png" alt="Analysis of registrations  from registrant point of view" style="width: 650px;">
<img src="https://blog.nzrs.net.nz/content/images/2018/05/sankey-1.png" alt="Analysis of registrations  from registrant point of view" style="width: 380px;">
<p>The figure below shows the retention rate of different age group. The retention rate of organisation owned domains is significantly higher than person owned ones in the first 5 years of age, after that the person owned domains starts to catch up. This means the year of age and the registrant type should be jointly considered when modeling the retention behaviour of a domain.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/retention.png" alt="Analysis of registrations  from registrant point of view" style="width: 600px;">
<p>Going wild a bit, let us have a look at another entry of registrant data, the email. We were expecting the organisations to first start to use the multiple email accounts to manage multiple domains as they have bigger capability in that. Counterintuitively, the figure below tells us that it is the majority of person registrants start to use multiple email accounts as their portfolio size grows.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/05/per_org_email-2.png" alt="Analysis of registrations  from registrant point of view" style="width: 800px;">
<h2 id="futureapplications">Future applications</h2>
<p>There are several interesting future applications of the registrant classifier. Due to the fact that registrant names have unstandardised free text and they are registrants provided, they naturally have anomalies. With the registrant classifier, we will be able to find the strange names that have low probability to occur and check the registrations under those names for further anomaly detection. And now that we know the retention behaviour of the two types of registrants is different, we will consider it as an important factor when modelling individual domain retention. And a great news:  the registrant classification is considered to be a potential feature in the <a href="https://registry.internetnz.nz/dotnz/portal/">.nz registrar portal</a>! Our registrars can have better knowledge of their registrants in the future.</p>
</div>]]></content:encoded></item><item><title><![CDATA[.nz is joining fellow domain industry leaders at GDD Industry Summit]]></title><description><![CDATA[<div class="kg-card-markdown"><p>I’m currently attending the ICANN organisation’s Global Domains Divison (GDD) Industry Summit being held in Vancouver, Canada.  Arriving to warm weather, soaring mountains and water as the back drop, certainly made the 14hr flight worthwhile.</p>
<p>Myself and another conference attendee Jennifer arranged to meet prior to the session</p></div>]]></description><link>https://blog.nzrs.net.nz/tracy-gddsummit/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f6100</guid><dc:creator><![CDATA[Tracy Johnson (Resigned)]]></dc:creator><pubDate>Tue, 15 May 2018 18:24:00 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2018/05/crop1.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2018/05/crop1.png" alt=".nz is joining fellow domain industry leaders at GDD Industry Summit"><p>I’m currently attending the ICANN organisation’s Global Domains Divison (GDD) Industry Summit being held in Vancouver, Canada.  Arriving to warm weather, soaring mountains and water as the back drop, certainly made the 14hr flight worthwhile.</p>
<p>Myself and another conference attendee Jennifer arranged to meet prior to the session content starting.  We agreed to hike up the Grouse Grind Trail.  A very steep trail that starts at 300 metres elevation and climbs to 1,100 metres.  I perhaps should have read the website beforehand, as it warned the hike is rated difficult.  My current lack of fitness certainly meant my body was unprepared for what lay ahead.  There are no flat sections – it is straight up.  However upon reaching the top we were rewarded with amazing views from the top of Grouse Mountain overlooking Vancouver city.  Reaching the top was certainly an accomplishment we were both proud of.</p>
<p>The GDD Industry Summit is now in full swing. It’s great to catch up with industry colleagues and customers from across the globe.  The event provides attendees the opportunity to engage and address issues of mutual interest and importance.</p>
<p>A common theme of many of the sessions is General Data Protection Regulation (GDPR), with the implementation date‎ of the ‎25 May 2018 approaching.  There are also a couple of sessions discussing brand and marketing which I’m looking forward to attending.  It’s a fantastic opportunity to learn from fellow domain industry leaders and work together to continue to provide world class domain services.</p>
<p>The remainder of my time will be continuing meetings with .nz authorised Registrars and potential new Registrars.  The atmosphere is certainly buzzing with acquaintances reconnecting and new connections being made.</p>
<p>If you are attending the summit and would like to schedule a meeting, please email <a href="mailto:tracy@internetnz.net.nz">tracy@internetnz.net.nz</a> or at the very least congratulate me on surviving the Grouse Grind Trail!</p>
</div>]]></content:encoded></item><item><title><![CDATA[Source Address Classification - Feature Engineering]]></title><description><![CDATA[<div class="kg-card-markdown"><h2 id="problembackground">Problem background</h2>
<p>As we operate the authoritative name servers for the .nz ccTLD (Country Code Top Level Domain), we observe more than 500k unique source addresses sending DNS queries to us every day. According to DNS standards, those addresses should mainly be DNS resolvers acting on behalf of their users.</p></div>]]></description><link>https://blog.nzrs.net.nz/source-address-clustering-feature-engineering/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60fd</guid><category><![CDATA[DNS traffic analysis]]></category><category><![CDATA[Pattern recognition]]></category><category><![CDATA[Time series analysis]]></category><dc:creator><![CDATA[Jing Qiao]]></dc:creator><pubDate>Wed, 21 Mar 2018 00:15:11 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2018/03/color-shape.jpeg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><h2 id="problembackground">Problem background</h2>
<img src="https://blog.nzrs.net.nz/content/images/2018/03/color-shape.jpeg" alt="Source Address Classification - Feature Engineering"><p>As we operate the authoritative name servers for the .nz ccTLD (Country Code Top Level Domain), we observe more than 500k unique source addresses sending DNS queries to us every day. According to DNS standards, those addresses should mainly be DNS resolvers acting on behalf of their users.</p>
<p>However, the real world is more complicated. Based on our observations, they're not all DNS resolvers. We've seen some known monitoring hosts constantly probing us for uptime testing. We've also seen some addresses setting Recursion Desired flag in their queries, indicating they're either non-resolvers or misconfigured resolvers. There're also many other addresses we're uncertain about what they are. These non-resolvers produce a lot of noise to our traffic data, which skews our view of the real use of the .nz namespace, and thus undermines the accuracy of our domain popularity ranking algorithm. If we're able to differentiate the resolvers from the non-resolvers, we can assign a higher weight to the traffic from a resolver in the domain popularity ranking algorithm to improve its accuracy.</p>
<p>Unfortunately, it's not easy to identify DNS resolvers, and as far as we know, no one has tried to do this using passive data from an authoritative DNS source.</p>
<h2 id="retrospect">Retrospect</h2>
<p>Aware of this challenge, we started to build a classifier to predict whether a source address is a resolver or not. First, we tried to build a supervised classifier. We collected a sample of known resolvers and monitors for training, and derived 14 features based on 1 day's DNS traffic received by the nameservers in NZ. Though the classifier didn't detect many non-resolvers as we expected, we realized our training data was bias by not including representative samples of other patterns, which caused the model to be overfitting. So we turned to the unsupervised method to learn the inherent structure of the data to discover more hidden patterns. The efforts of training the supervised and unsupervised classifier were presented together in <a href="https://cdn.nzrs.net.nz/ADLAnl8xL-wE3/QZRw5VB04zMOe/In%20the%20search%20of%20resolvers%20-%20DNS-OARC%2025.pdf">&quot;In the search of resolvers&quot;, OARC 25</a>.</p>
<p>The unsupervised model showed some interesting patterns and performed well by clustering together most of the known resolvers, but had a problem of mixing some monitors with resolvers in the same cluster. The reason could be that our feature set was not enough to differentiate some monitors from resolvers, or 1 day's data is not enough for the discrimination. So we extended the data to a longer period of time, 92 days, and looked for more features that could help capture the underlying differences. We ended up with a new clustering model with 30 features documented in <a href="https://cdn.nzrs.net.nz/Je__JbRB6wleb/ZEZb.LqQZPr9l/Understanding%20Traffic%20Sources%20Seen%20at%20.nz.pdf">&quot;Understanding traffic sources seen at .nz&quot;, OARC 27</a>.</p>
<p>Recently, we've been extending our sample set, doing a lot of feature engineering and got further progress on improving the unsupervised clustering. This post will focus on the feature engineering part and a second post in the near future will explain the clustering efforts.</p>
<h2 id="featureengineeringiskey">Feature engineering is key</h2>
<p>In a machine learning practice, you need to define a set of <a href="https://en.wikipedia.org/wiki/Feature_(machine_learning)">features</a> that's fed into a machine learning algorithm to let it learn and predict. A good feature set often determines your success in practice. But the signals are often hidden in the raw data and implicit features need to be found by human intuition and domain knowledge. So <a href="https://en.wikipedia.org/wiki/Feature_engineering">feature engineering</a> is recommended to be the critical part to put the effort on to succeed in a machine learning practice.</p>
<blockquote>
<p>&quot; Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.<br>
&quot;  - Andrew Ng</p>
</blockquote>
<p>You might like to take a deeper look at feature engineering in <a href="https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/">this post</a>.</p>
<h2 id="initialsetoffeatures">Initial set of features</h2>
<p>We have DNS queries and responses captured and saved in Hadoop tables for analysis. Each query record contains most fields in a DNS query packet, such as the source IP address, timestamp, query name, query type, TTL(Time To Live), DNS flags, and etc. From a given source address, we can build a query stream with the response status, from which we extract and derive relevant features that can help characterize the query pattern and capture the uniqueness discriminating it from those of other types of source address.</p>
<p>You may want to refer to <a href="http://blog.nzrs.net.nz/characterization-of-popular-resolvers-from-our-point-of-view-2/">&quot;DNS in a Glimpse&quot;</a> in my former blog for more about DNS query types and DNS response codes.</p>
<p>We initially brought up with 14 features, which can be categorized into 3 groups:</p>
<ul>
<li>
<p><strong>Fraction of unique query types</strong> could have different distribution for a different type of source address. For example, a resolver is likely to send a significant amount of type A queries asking for IPv4 addresses and also a certain amount of other query types, such as AAAA, MX, DS for doing validation, and etc., while a monitor or some tool, checking whether our server is running, probably only send queries for the same query type again and again.</p>
</li>
<li>
<p><strong>Fraction of response codes</strong> can be useful when recognizing source addresses generating mostly NXDOMAIN (Domain doesn't exist) or REFUSED by the server, which is abnormal.</p>
</li>
<li>
<p><strong>Fraction of DNS flags</strong> can also be useful features. For example, a normal resolver shouldn't set RD (Recursion Desired) bit in its queries to an authoritative name server.</p>
</li>
</ul>
<h2 id="timefactor">Time factor</h2>
<p>The dataset we're analyzing, the DNS query stream, is a <a href="https://en.wikipedia.org/wiki/Time_series">time series</a>. We need to consider the temporal behavior in our feature engineering. We've started recording the data since 2012 for the .nz name servers located in New Zealand. We just had access to additional locations hosted by <a href="https://cira.ca/">CIRA</a>. Now our dataset covers about 80% of the total .nz traffic.</p>
<p>We select a time window across 4 weeks in which we can observe a source address's daily behavior, weekday and weekends pattern, and profile in the whole period.</p>
<p>For daily pattern, we aggregated the query data on each day and calculated the amount of <strong>query types</strong>, <strong>domain names</strong>, <strong>query names</strong>, <strong>total queries</strong>, <strong>queries per domain</strong>, to produce multiple time series for a source address. For the time series classification and clustering, <a href="http://alexminnaar.com/time-series-classification-and-clustering-with-python.html">this article</a> contains some interesting ideas and practices for reference. The most common and simple way is to summarize each time series using descriptive statistics like <strong>mean and percentiles</strong>.</p>
<p>For capturing the weekday versus weekends pattern, we calculated <strong>the fraction of weekday versus weekends in the total number of visible days</strong> for each source address.</p>
<p>For the whole period, we aggregated the 14 initial features. We also calculated the <strong>visible days and hours</strong> of each source address, which capture the activity of the source address during this period.</p>
<h2 id="datacleaning">Data cleaning</h2>
<p>I have to mention this as it's an important step to reduce the noise before feeding the data into the classification model. By domain knowledge, we defined some criteria that help filter out the addresses almost certainly not resolvers, and the instances with a very small number of data points making it not representative. For example, source addresses queried for the same domain during the whole 28-day period were removed, as a DNS resolver is not likely to query for only 1 domain in that long period. The source addresses only active for 1 day during the period or sending less than 10 queries per day were also removed, as they didn't present enough data in this period. By applying these conditions, we reduced our dataset from 2M to 500k unique source addresses.</p>
<p>To further remove noise, we selected the instances for source addresses with high visibility, thus reducing to 82k unique sources. We assume a source address with a high visibility in the 28-day period is very likely to be an active resolver or a monitor. Now we're going to find more features relevant to differentiating an active resolver from a monitor.</p>
<h2 id="entropy">Entropy</h2>
<p>Inspired by <a href="https://www.oreilly.com/ideas/identifying-viral-bots-and-cyborgs-in-social-media">this article</a>, we came up with the idea of using entropy to measure the amount of information in the DNS query flow from a source address. By instinct, a resolver, as the proxy of its real users, should have more randomness in its query flow. On the other hand, a monitor, which is programmed or triggered in a certain way and thus should be more deterministic.</p>
<p>From this point of view, we thought of two aspects that would affect the entropy of a source address, the inter-arrival time and the similarity of query names between consecutive queries. For the inter-arrival time, we calculated <strong>the inter-arrival time between two consecutive queries</strong>, and <strong>the inter-arrival time between two consecutive instances of the same query</strong> (Two queries are the same when they contain identical query name and query type). For the query name, we calculated <strong>the similarity of the query names between two consecutive queries</strong> using <a href="https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance">Jaro Winkler string distance</a>.</p>
<h2 id="variability">Variability</h2>
<p>We also came up with variability as a way to measure the difference between an active resolver and a monitor. A monitor's query flow is likely to be stable across time,  and a resolver, on behalf of its users, should be more random in its queries. We found some <strong>variance metrics</strong>, such as <a href="https://en.wikipedia.org/wiki/Interquartile_range">IQR</a>, <a href="https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion">QCD</a>, <a href="https://en.wikipedia.org/wiki/Mean_absolute_difference">Mean Absolute Difference</a>, <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">Median Absolute Deviation</a>, <a href="https://en.wikipedia.org/wiki/Coefficient_of_variation">CV</a>, powerful to quantify the variability of a set of data.</p>
<p>We first partitioned the whole period by the hour, and aggregated the time series features per hour, as well as calculated the mean entropy in each hourly window. Then we use the aforementioned variance metrics to calculate the variability of these aggregated features across the hour tiles.</p>
<h2 id="featureselection">Feature selection</h2>
<p>After the feature brainstorming and exploration above, we come up with 66 features in total. They might not be all good features though. Redundant and irrelevant features will introduce noise into the dataset, thus could decrease the accuracy of the model.</p>
<p>First, we did a redundancy check on our feature set by computing the correlations among the features. 12 features with above 0.95 correlation were removed.</p>
<p>Within the 82k source address to work with, we have 558 labeled resolvers and 82 monitors. It's possible to explore which features are most relevant to the label value. We tried <a href="http://scikit-learn.org/stable/modules/feature_selection.html">univariate feature selection</a> with different statistical tests, such as <a href="http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test">F-test</a> and <a href="https://en.wikipedia.org/wiki/Mutual_information">Mutual Information (MI)</a>.<br>
<img src="https://blog.nzrs.net.nz/content/images/2018/03/test_score.png" alt="Source Address Classification - Feature Engineering"></p>
<p>As illustrated in <a href="http://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs-mi-py">this example</a>, mutual information method can capture any kind of statistical dependency while F-test captures only linear dependency. In the above plot, some features get a low score in F-test but a high score in MI test, indicating they have a non-linear relationship with the target.</p>
<p>Diving into the features with high scores in both F-test and MI, we can find, features around the fraction of query types beyond A, AAAA, DS and DNSKEY, and the variability of the number of unique query types across hours play a predominant role in demarcating the resolvers and monitors in our labeled data.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/03/both_high.png" alt="Source Address Classification - Feature Engineering"></p>
<p>On the other hand, features showing low scores in both metrics, such as the fraction of visible days and hours, the fraction of weekdays versus weekends, the fraction of the response code of REFUSED and the fraction of the Recursion Desired DNS flag got a very low score in both tests. This is as expected as we filtered out the low visibility instances which make the visibility features less meaningful. And Recursion Desired flag and Refused response code seem not significant in our sample data to help differentiate resolvers and monitors, but could be useful features in another scenario, for example, anomaly detection.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/03/both_low.png" alt="Source Address Classification - Feature Engineering"></p>
<p>In addition, we can try a different feature selection algorithm, for example, the algorithm that evaluates the coeffect of multiple features such as <a href="http://scikit-learn.org/stable/modules/feature_selection.html">Recursive feature elimination</a>. We can also rank the features on the clustering result to help comprehend and interpret the model, which will be introduced in a future post.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Feature engineering is hard. It's where the domain knowledge and creativity show the power. We put a lot of effort into it and got more and more good features that really improved the clustering result. Thanks Sebastian for providing great ideas on the entropy and variance metrics. Through feature engineering, we not only find the way to improve our domain popularity ranking algorithm, but also got a deeper understanding of our data, and learned lots of techniques that further boost our expertise. Later I will write something about the clustering model trained on these features.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Registrant Classification using Machine Learning]]></title><description><![CDATA[<div class="kg-card-markdown"><p>An important task in text data mining is text classification/categorisation. The objective, is to start with a training set of documents (e.g. newspaper articles) labeled with a class (e.g. ‘sports’, ‘politics’) and then to determine a classification model to assign a correct class to a new document</p></div>]]></description><link>https://blog.nzrs.net.nz/registrant-classification/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60fc</guid><dc:creator><![CDATA[Huayi Jing]]></dc:creator><pubDate>Mon, 19 Feb 2018 02:32:13 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/alex-knight-2EJCSULRwC8-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/alex-knight-2EJCSULRwC8-unsplash.jpg" alt="Registrant Classification using Machine Learning"><p>An important task in text data mining is text classification/categorisation. The objective, is to start with a training set of documents (e.g. newspaper articles) labeled with a class (e.g. ‘sports’, ‘politics’) and then to determine a classification model to assign a correct class to a new document automatically. In machine learning (ML), this is a standard supervised learning problem.</p>
<p>As .nz registry, we collect some information about registrants during the registration process, including the registrant’s name, with no distinction between individuals or organisations. The registrant types is of great interest to us since it helps us to have a better understanding of the status of the register and, together with our domain industry classification and other information, create targeted campaigns in the future. Our objective is to have a classifier model that can automatically predict whether a name is a person or an organisation. This is a typical text classification problem. The steps of solving the problem are summarised below:</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2018/02/flow-1.png" alt="Registrant Classification using Machine Learning"></p>
<p>In the beginning, we explored a probabilistic approach using Name Entity Recognition with 2,000 manually classified names. Since the Natural Language Processing (NLP) libraries become more and more mature and the Deep Learning models are commonly used, it is a good idea to solve the problem with these techniques and see how we can benefit from them, which is the focus of this blog post.</p>
<h2 id="dataandpreprocessing">Data and Preprocessing</h2>
<p>We have 296,774 unique names in the data set collected in Feb 2017. The first step is text preprocessing. The length of names varies from 1 to 18 words (including numbers of symbols). The figure below shows that, 79.05% of the names are less than 4 words long. With a further look at the data, names that are more than 4 words long are mostly organisations. However, short names might be persons or organisations. For example, Lei Li, Job Ltd.</p>
<div>
    <a href="https://plot.ly/~linking/106/?share_key=x0zCGWIHvtPecJgpcHAuzW" target="_blank" title="Plot 106" style="display: block; text-align: center;"><img src="https://plot.ly/~linking/106.png?share_key=x0zCGWIHvtPecJgpcHAuzW" alt="Registrant Classification using Machine Learning" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="linking:106" sharekey-plotly="x0zCGWIHvtPecJgpcHAuzW" src="https://plot.ly/embed.js" async></script>
</div>
<p>The histogram below shows the top 30 popular words. Since most organisation names end with ‘ltd’ or ‘limited’, it is not surprising to see them as the most popular ones. Other popular words includes locations(e.g. nz, New Zealand) and words indicating the service an organisation provides (e.g. solutions, trust). An interesting observation is that the most popular names are all male names.</p>
<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~linking/40.embed"></iframe>
<h2 id="themodels">The Models</h2>
<p>As a benchmark, we first tried traditional machine learning models using <a href="http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">Pipeline in Sklearn</a>, which sequentially applies some transformers and a classifier. It is very easy to use and fast. We have 29,348 hand classified names. The training/testing split we used was 90/10. We tried SVM, Naive Bayes, and Logistic Regression as classifiers and the accuracy was 92.0%, 91.8% and 92.1% respectively.</p>
<p>Another method is to first transform words into vectors using <a href="https://arxiv.org/abs/1301.3781">Word2Vec</a> or <a href="https://radimrehurek.com/gensim/models/doc2vec.html">Doc2Vec</a> and then train a machine learning algorithm (i.e., a classifier).  We tried both of them with <a href="https://radimrehurek.com/gensim/index.html">Gensim</a>. Word2Vec is a machine learning algorithm based on neural network that can learn the relationship between words automatically. We take the average of the word vectors in a registrant name so that we can represent a name using just one vector. Doc2Vec can represent a whole document as vector automatically. We tried several different classifiers including SVM, Naive Bayes, Random Forest, Logistic Regression, KNN and MLP-NN. The accuracy for Word2Vec vectors ranges from 83% (Naive Bayes) to 93.3% (MLP-NN). For Doc2Vec vectors, the accuracy ranges from 77.9% (Naive Bayes) to 90.7% (MLP-NN).</p>
<p>Now, neural network is not new to us anymore since Word2Vec uses a shallow 2-layer neural network to produce word vectors. A deep neural network like Convolutional Neural Network (CNN) has more layers and is widely used in computer vision and NLP.  The one we used has an <a href="http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/">embedding layer, followed by a convolutional, max-pooling and softmax layer</a>. One can either train the word embeddings from scratch using CNN or use a retrained word embedding. <a href="https://arxiv.org/abs/1510.03820">Ye and Bayern (2015)</a> found that using a pre-trained word embedding performed better than not using one. We used Google’s <a href="https://code.google.com/archive/p/word2vec/">pre-trained word2vec embeddings</a> and implemented the CNN with <a href="https://www.tensorflow.org">Tensorflow</a>. With 71 parameter sets for grid search, the training took 5.5 hours and the best accuracy we get was 91.1%.</p>
<p>Another deep learning model we tried was the Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM). A good introduction on LSTM-RNN can be found <a href="https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/">here</a>. RNN is known to perform well in text classification problem. It is designed to learn from sequences of data where time dependancy is important and therefore another application of RNN is to do time-series analysis. LSTM helps RNN to focus on certain part of a sequence and ignore unimportant words.  It was implemented with <a href="https://keras.io">Keras</a> and the vocabulary was trained from scratch. The best accuracy we get is 92.7% after training 20 epochs for 2.2 hours.</p>
<p>Finally we tried the <a href="https://research.fb.com/fasttext/">fastText</a> developed by Facebook. The library is surprisingly very fast in comparison to other methods for achieving the same accuracy. The accuracy we get is 92.9%. The best accuracy from different models are summarised below.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/02/accuracy-1.png" alt="Registrant Classification using Machine Learning" style="width: 600px;">
<p>Six registrant names (treated to protect registrants’ privacy) are selected to compare predictions generated from the trained models. They represent names (1) which are probably persons, e.g., “Jeremy Ashford”; (2) have single word in the name but still not hard to tell, e.g., “Jacqui”; (3) have important words in the name and hence very easy to predict, e.g., “Treecare Ltd”; (4) have single compound word and have important words in it, e.g., “Techsoft”; (5) seem to be a person name but not common english names, e.g., “Tan”; (6) have no clear meaning and it might be very hard for even a human to tell, e.g., “wjja”.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/02/compare.png" alt="Registrant Classification using Machine Learning" style="width: 600px;">
<h3 id="somethoughts">Some Thoughts</h3>
<p>Solving the registrant classification problem has been a good opportunity for us to learn and apply different models used for text classification. There is still room for improvement. For example, we can apply grid search on parameters used in traditional machine learning models and LSTM-RNN to achieve higher accuracy. Also, if the prediction errors across different models are not highly correlated, there could be a good oppurtunity to benefit from ensembles. We are trying these things out to boost the accuracy and will share with you our final model.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Registrar Size Prediction]]></title><description><![CDATA[<div class="kg-card-markdown"><p>An interesting request we received after the <a href="https://vimeopro.com/nzrs/2017-nz-registrar-conference/video/217935980">register size prediction</a> presentation at the Registrar Conference earlier last year, was a suggestion of applying similar methodology on the data per registrar. In this blog post, we will share the methodology and insights found during the registrar size prediction modelling.</p>
<h2 id="descriptiveanalysis">Descriptive analysis</h2></div>]]></description><link>https://blog.nzrs.net.nz/registrar-size-prediction/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60fb</guid><dc:creator><![CDATA[Huayi Jing]]></dc:creator><pubDate>Mon, 15 Jan 2018 03:16:09 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/bill-mackie-if6nfbh9vna-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/bill-mackie-if6nfbh9vna-unsplash.jpg" alt="Registrar Size Prediction"><p>An interesting request we received after the <a href="https://vimeopro.com/nzrs/2017-nz-registrar-conference/video/217935980">register size prediction</a> presentation at the Registrar Conference earlier last year, was a suggestion of applying similar methodology on the data per registrar. In this blog post, we will share the methodology and insights found during the registrar size prediction modelling.</p>
<h2 id="descriptiveanalysis">Descriptive analysis</h2>
<p>As of 1st August 2017, there are 89 active registrars in the .nz register. The prediction procedure is not feasible for all the registrars. Two features help us to determine whether a registrar’s data is statistically ready for prediction: (1) the age, i.e. how long a registrar’s data has existed in the register. The more data points we have for prediction modelling, the more accurate the prediction will be. (2) the size. The larger a registrar’s size is, the clearer the trend we may find underlying the historical change in its size.</p>
<div>
    <a href="https://plot.ly/~linking/17/?share_key=wGGmec5443CUOM7xbAwHKI" target="_blank" title="Plot 17" style="display: block; text-align: center;"><img src="https://plot.ly/~linking/17.png?share_key=wGGmec5443CUOM7xbAwHKI" alt="Registrar Size Prediction" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="linking:17" sharekey-plotly="wGGmec5443CUOM7xbAwHKI" src="https://plot.ly/embed.js" async></script>
</div>
<p>As shown in the above figure, the distribution of a registrar’s age and size exhibits interesting points. Among these registrars, 40 of them have under 1000 active domains. The smallest registrar, although it has been recorded since June 2010, has only 7 domains. The age distribution shows the opposite. 76 registrars are older than 5 years old, among which 56 registrars are older than 10 years old (Note that there have been transfers of domains between registrars now and then for different reasons, so some of the information here might not be precise). The registrars that are at least 60 months old and have at least 5000 domains are selected for prediction. That leaves us 20 registrars.</p>
<h2 id="theprediction">The prediction</h2>
<p>The prediction procedure follows the one described in the previous <a href="http://blog.nzrs.net.nz/register-size-prediction/">blog</a>. Two assumptions are made so that the procedure can be applied: (1) although a domain might be transferred to another registrar, it is treated as a new create into that registrar. (2) different SLDs are assumed to behave similarly, so that we have enough data points for prediction.</p>
<p>Let’s first have a look at the retention behaviour. Take Registrar A as an example, the following two figures show the retention behaviour of <em>domains registered at different periods</em> (i.e. cohorts, the retention rate is estimated using multi-cohorts data). It can be seen that the drop out rate is high in the early years and then slows down as the domains stay longer. From the heat map we can observe, on average, the retention rate of relatively recent cohort is higher, which is a good thing to know.</p>
<div>
    <a href="https://plot.ly/~linking/18/?share_key=G2k4OKpGamsmyLvlHtgGea" target="_blank" title="Plot 18" style="display: block; text-align: center;"><img src="https://plot.ly/~linking/18.png?share_key=G2k4OKpGamsmyLvlHtgGea" alt="Registrar Size Prediction" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="linking:18" sharekey-plotly="G2k4OKpGamsmyLvlHtgGea" src="https://plot.ly/embed.js" async></script>
</div>
<img src="https://blog.nzrs.net.nz/content/images/2018/01/reg128cohorts.png" alt="Registrar Size Prediction" style="width: 600px;">
<p>The new creates forecast reveals some interesting findings as well. See the two registrars showing below, the historical new creates data shows a clear downward trend. This might indicate a change in the focus of the business. The forecast therefore could be negative for some periods and will be replaced by zero.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/12/registrardecline-12.png" alt="Registrar Size Prediction" style="width: 600px;">
<p>Some registrars have extremely stationary and low-quantity new creates over time, see the figure below. For such registrars, there are barely trend or cyclic fluctuations underlying the data points.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/12/registrarstable.png" alt="Registrar Size Prediction" style="width: 600px;">
<p>On the opposite, some registrars have new creates data that fluctuates greatly and shows no clear trend or seasonality. Accurate forecasts for such cases are hard. A further look at the reasons behind those fluctuations will be helpful for more reasonable forecast.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/01/fluctuation.png" alt="Registrar Size Prediction" style="width: 600px;">
<p>To test the performance of the prediction, historical data up to May 2017 is used to make prediction for June, July and August 2017. The table below shows the MAPE (mean absolute percent error) for each registrar in descending order by size. As mentioned before, smaller registrar size makes it harder for prediction. Hence it is not surprising to see some of the bottom 10 registrars have a MAPE greater than 10%. In general, our procedure generates comparatively accurate prediction for relatively big registrars.</p>
<img src="https://blog.nzrs.net.nz/content/images/2018/01/MAPE-1.png" alt="Registrar Size Prediction" style="width: 350px;">
<p>Finally, let’s see the prediction results for the top 20 registrars. The total size of this group is increasing over time. Larger registrars also show an increasing trend. Some registrars’ size decrease slightly each month. This is due to the forecasted low new creates and/or comparatively larger number of drop outs in certain months.</p>
<div>
    <a href="https://plot.ly/~linking/29/?share_key=6E2NYJbK7iMFM9pyMDUwGA" target="_blank" title="r1-r10" style="display: block; text-align: center;"><img src="https://plot.ly/~linking/29.png?share_key=6E2NYJbK7iMFM9pyMDUwGA" alt="Registrar Size Prediction" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="linking:29" sharekey-plotly="6E2NYJbK7iMFM9pyMDUwGA" src="https://plot.ly/embed.js" async></script>
</div>
<div>
    <a href="https://plot.ly/~linking/22/?share_key=ufWNoYuN38TwOrPHFU6Xz3" target="_blank" title="r11-r20_2018" style="display: block; text-align: center;"><img src="https://plot.ly/~linking/22.png?share_key=ufWNoYuN38TwOrPHFU6Xz3" alt="Registrar Size Prediction" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="linking:22" sharekey-plotly="ufWNoYuN38TwOrPHFU6Xz3" src="https://plot.ly/embed.js" async></script>
</div>
<p>Registrar size prediction is more challenging compared with register size prediction due to the data quality after segmentation. Nonetheless, some interesting findings surface in between. Since bulk transfer of domains between registrars happen for various reasons (e.g., movement of re-sellers or large portfolio holders between registrars), a further investigation on those cases will help improve the quality of data and prediction. For data that is reasonably stationary, using a naive or moving average forecasting technique might be a better choice. These could be directions for follow on work.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Scanning .nz for HTTPS support]]></title><description><![CDATA[<div class="kg-card-markdown"><p>As part of our efforts to understand the .nz namespace better, we started at the beginning of 2017 to check domains for the presence of a secure website using HTTPS, and collect information about certificates, protocol features and other valuable information in the process.</p>
<p>The collection process is straightforward:</p>
<ul>
<li>Extract</li></ul></div>]]></description><link>https://blog.nzrs.net.nz/scanning-nz-for-https-support/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60fa</guid><dc:creator><![CDATA[Sebastian Castro]]></dc:creator><pubDate>Thu, 19 Oct 2017 22:32:25 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/philipp-katzenberger-iIJrUoeRoCQ-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/philipp-katzenberger-iIJrUoeRoCQ-unsplash.jpg" alt="Scanning .nz for HTTPS support"><p>As part of our efforts to understand the .nz namespace better, we started at the beginning of 2017 to check domains for the presence of a secure website using HTTPS, and collect information about certificates, protocol features and other valuable information in the process.</p>
<p>The collection process is straightforward:</p>
<ul>
<li>Extract the list of active .nz domains in the register</li>
<li>For each domain, verify if there is an A record for the host <em>www</em>.<strong>domain</strong>. If the resolution process fails, the domain won't be included.</li>
<li>Test each domain using <a href="https://github.com/nabla-c0d3/sslyze">sslyze</a>. We have a script that will test for different versions of SSL and TLS, protocol features and will collect information about the certificate chain on the site.</li>
<li>Once the collection is completed, produce aggregated counters and make them available on <a href="https://idp.nz/Domain-Names/-nz-SSL-scan-results/cmxt-74aq">IDP</a>, our Internet Data Portal.</li>
</ul>
<p>We started in January 2017 and so far the process has completed five times, giving us some valuable datapoints.</p>
<h3 id="httpssupport">HTTPS support</h3>
<p>Let's start with the overall picture of how much of the .nz namespace has a secure website.</p>
<div id="https_support_vis"></div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.19.1/moment.min.js"></script>
<script>
    var dataSource = "https://idp.nz/resource/8khv-hfif.json" + 
"?Classification='Total'";

    $.getJSON(dataSource, function(data, textstatus) {
          var points = {};
          $.each(data, function(i, entry) {
               var x_point = moment(entry.date.substr(0, 10)).format('MMM YYYY');
               var y_point = +(100*(entry.count/entry.domains)).toFixed(2);
               if (entry.metric in points) {
                    points[entry.metric]['x'].push(x_point);
                    points[entry.metric]['y'].push(y_point);
               }
               else {
                   points[entry.metric] = { x: [x_point], y: [y_point] };
               }
            });

            // Convert points into something suitable for plotly
            var data = [];
            var prefColor = {
                    'Broken DNS': 'rgb(215,25,28)',
                    'No HTTPS Support': 'rgb(253,174,97)',
                    'HTTPS Support': 'rgb(171,217,233)',
                    'Invalid Certificate': 'rgb(44,123,182)'
            };

            for (var m in points) {
                data.push({
                    type: 'bar',
                    x: points[m].x,
                    y: points[m].y,
                    marker: {
                        color: prefColor[m]
                    },
                    name: m });
            }
            var https_layout = {
                autosize: true,
                title: '.nz HTTPS Support',
                yaxis: { title: '% of domains' },
                xaxis: { zeroline: true,
                        showline: false,
                        type: 'category'
                },
                barmode: 'stack',
                margin: { l: 50, r: 50, b: 20, t: 40 }
            };

            Plotly.newPlot('https_support_vis', data, https_layout);
        });
</script>
<p>On our first collection, we didn't register how many domains failed the test due to incorrect DNS response, it was added afterwards. Despite that detail, there is a lot to notice here. Around 14% of the domains fail the DNS test, and around 0.7% have invalid certificates, for example, where the name in the certificate doesn't match the website name. The positive news is HTTPS support grew from 44% to 47% during this year.</p>
<h3 id="protocolsupport">Protocol Support</h3>
<p>HTTPS relies on cryptographic protocols to ensure privacy and authenticity, such as SSL and TLS. A given webserver can support multiple crypto protocols at the same time. SSL v2.0 was released in February 1995, SSL v3.0 released in 1996, TLS v1.0 was defined in January 1999 as a replacement for SSL v3.0, TLS v1.1 was published in April 2006, TLS v1.2 was published in August 2008 and TLS v1.3 is currently being drafted in the IETF and we don't test for it.</p>
<div id="proto_support_vis"></div>
<p>As SSL v2.0 was deprecated and prohibited in 2011 in <a href="https://tools.ietf.org/html/rfc6176">RFC 6176</a>, and SSL v3.0 was deprecated in June 2015 by <a href="https://tools.ietf.org/html/rfc7568">RFC 7568</a>, <strong>there SHOULD NOT be any domains supporting it.</strong> Sites with those crypto protocols activated are a risk given the number of known exploitable vulnerabilities against SSL. If you are an administrator of a secure website, you can use the excellent <a href="https://www.ssllabs.com/ssltest/">Qualys SSL Site tester</a> to verify for weaknesses.</p>
<h3 id="certificatepublickeys">Certificate Public Keys</h3>
<p>Crypto protocols use public key cryptography to provide privacy and authenticity. To achieve this, SSL and TLS rely on certificates issued by Certificate Authorities (CA), that authenticate the identity of the website you are visiting, and contain a cryptographic key to encrypt traffic.</p>
<p>Cryptographic keys have two main properties: the algorithm used to generated them, where we can have RSA and ECC, and the key size measured in bits.</p>
<p>As part of the testing, we collect what kind of cryptographic keys the sites with HTTPS enabled are using.</p>
<div id="cert_key_vis"></div>
<p>You can see 3 different key sizes for RSA keys: 1024, 2048 and 4096. A 1024-bit RSA key is considered weak and should not be used. The fact keys of that size were visible at the beginning of this year but now have disappeared shows a healthy attitude towards security maintenance. The transition from 2048-bit keys to 4096-keys observed since August 2017 also is a great indication of the hygiene of the .nz namespace.</p>
<p>What about the <strong>256 id-ecPublicKey</strong> cases? Those are sites using the new Eliptic Curve DSA cryptography, which uses smaller key sizes. There are also a few cases using 384-bit ECDSA keys.</p>
<p>To make the visualisation easy to read, we omit values with less than 1% of domains, but we can find some oddities such as sites with unusual key sizes. For example, there are still a few cases with weak 512-bit RSA keys, or 3000-bit keys, or 1536-bit keys (1536 is half-way between 1024 and 2048).</p>
<h3 id="certificatesignaturealgorithms">Certificate Signature Algorithms</h3>
<p>Each SSL certificate contains a digital signature, a hash of the certificate content, signed with the issuing Certificate Authority private key. This digital signature allows the verification of the integrity of the certificate and enables browsers to verify the validity of the certificate.</p>
<p>There are a few hashing algorithms used for signatures, such as MD5, SHA-1 family and SHA-256 family, including SHA-256, SHA-384 and SHA-512.</p>
<div id="sign_vis"></div>
<p>Great news right? Most of the sites use the strong SHA-256 hashes, with RSA or ECDSA. As SHA-1 is subject to collision attacks, it's feasible to generate fraudulent certificates and trick users into using the wrong website. Firefox <a href="https://blog.mozilla.org/security/2014/09/23/phasing-out-certificates-with-sha-1-based-signature-algorithms/">now</a> warns users visiting secure websites using certificates signed with SHA-1, and NIST <a href="https://csrc.nist.gov/publications/detail/sp/800-57-part-1/rev-3/archive/2012-07-10">has recommended</a> moving away from SHA-1 since 2014. So despite the fraction of sites with weak hashes moving from 7.1% to 4.5%, we still have reasons to be concerned.</p>
<p>As with the previous plot, for clarity we didn't include signature algorithms with less than 1% of the domains. There are a few domains using MD5 as hashing algorithms in their certificates, considering MD5 was deprecated back in 2013, that's certainly not good news.</p>
<h3 id="certificateauthorities">Certificate Authorities</h3>
<p>As we mentioned before, certificates are issued by Certificate Authorities, and browsers are configured to trust a set of CAs. An administrator can relatively easily set up their own CA and create certificates, but those won't be verifiable by a browser.</p>
<p>The plot below shows the top 10 issuing Certificate Authorities observed in the .nz namespace. For simplicity, we group the long tail of CAs including self-issued certificates into the 'Other' category.</p>
<div id="ca_vis"></div>
<p>The figure shows the biggest players in the certificate industry as observed in New Zealand: UserTrust Network, GoDaddy, GeoTrust, Comodo and Let's Encrypt. The interesting bit is the evolution across time: Let's Encrypt grew from 16.29% to 22.84% of the domains in 9 months, and that growth came at the cost of more traditional CAs such as GoDaddy, GeoTrust and Comodo. The other CA with a massive growth is UserTrust Network, that nearly tripled their market share since January 2017.</p>
<p>Let's Encrypt provides an interesting story: started back in December 2015 to issue certificates for free, with a validity of 90 days and a fully automated process. Before that, all CAs were for profit organisations, charging in some cases considerable fees to issue a certificate. <a href="https://letsencrypt.org/">Let's Encrypt</a> is supported by the Internet Security Research Group (ISRG), which includes Facebook, the Mozilla Foundation, Google Chrome, EFF and many others.</p>
<h3 id="wrapup">Wrap up</h3>
<p>We started this collection with the motivation of finding out more about trust and security in the .nz namespace and we've learned a lot in the process. The collection will carry on every other month and a future blog post could cover more details about the protocols, such as negotiation features, certificate chain length and intermediate CAs.</p>
<script> 
// Data about Protocol Support
var dataSource2 = "https://idp.nz/resource/8khv-hfif.json" + "?$where=Classification='Protocol Support'&$order=date,metric";
var dataSource3 = "https://idp.nz/resource/8khv-hfif.json" + "?$where=Classification='Certificate Public Key'&$order=date,metric";
var dataSource4 = "https://idp.nz/resource/8khv-hfif.json" + "?$where=Classification='Certificate Signature Algorithm'&$order=date,metric";
var dataSource5 = "https://idp.nz/resource/8khv-hfif.json" + "?$where=Classification='CA issuer'&$order=date,metric";
var totalSource = "https://idp.nz/resource/8khv-hfif.json" + "?Classification='Total'&Metric='HTTPS Support'";

// Get the totals first
$.getJSON(totalSource, function(total, jsonstatus) {
    var totals = {};
    $.each(total, function(i, entry) {
        var x = moment(entry.date.substr(0, 10)).format('MMM YYYY');
        totals[x] = entry.count;
    });

    $.getJSON(dataSource2, function(data, textstatus) {
        var points = {};

        $.each(data, function(i, entry) {
            var x_point = moment(entry.date.substr(0, 10)).format('MMM YYYY');
            var y_point = +(100*(entry.count/totals[x_point])).toFixed(2);
            if (x_point in points) {
                points[x_point]['x'].push(entry.metric);
                points[x_point]['y'].push(y_point);
            }
            else {
                points[x_point] = { x: [entry.metric], y: [y_point] };
            }
        });

        // Convert points into something suitable for plotly
        var data = [];
        for (var m in points) {
            data.push({ type: 'bar', x: points[m].x, y: points[m].y, name: m });
        }

        var proto_layout = {
            autosize: true,
            title: 'HTTPS Protocol Support',
            yaxis: { title: '% of domains' },
            xaxis: { zeroline: true,
                    showline: false,
                    type: 'category',
                    tickvals: ['SSL v2', 'SSL v3', 'TLS v1.0', 'TLS v1.1', 'TLS v1.2'],
                    ticktext: ['SSL 2', 'SSL 3', 'TLS 1.0', 'TLS 1.1', 'TLS 1.2']
            },
            margin: { l: 50, r: 50, b: 20, t: 40 }
        };

        Plotly.newPlot('proto_support_vis', data, proto_layout);
    });

    // Next plot
    $.getJSON(dataSource3, function(data, textstatus) {
        var points = {};

        $.each(data, function(i, entry) {
            var x_point = moment(entry.date.substr(0, 10)).format('MMM YYYY');
            var y_point = +(100*(entry.count/totals[x_point])).toFixed(2);
            if (y_point > 1.0) {
                if (x_point in points) {
                    points[x_point]['x'].push(entry.metric);
                    points[x_point]['y'].push(y_point);
                }
                else {
                    points[x_point] = {
                        x: [entry.metric],
                        y: [y_point] };
                }
            }
        });

        // Convert points into something suitable for plotly
        var data = [];
        for (var m in points) {
            data.push({ type: 'bar',
                x: points[m].y,
                y: points[m].x,
                orientation: 'h',
                name: m });
        }

        var key_layout = {
            autosize: true,
            title: 'Certificate Public Key Distribution',
            yaxis: { type: 'category' },
            xaxis: { zeroline: true,
                    showline: false,
                    title: '% of domains'
            },
            margin: { l: 150, r: 50, b: 30, t: 30 }
        };

        Plotly.newPlot('cert_key_vis', data, key_layout);
    });

    // Next Plot
    $.getJSON(dataSource4, function(data, textstatus) {
        var points = {};

        $.each(data, function(i, entry) {
            var x_point = moment(entry.date.substr(0, 10)).format('MMM YYYY');
            var y_point = +(100*(entry.count/totals[x_point])).toFixed(2);
            if (y_point > 1.0) {
                if (x_point in points) {
                    points[x_point]['x'].push(entry.metric);
                    points[x_point]['y'].push(y_point);
                }
                else {
                    points[x_point] = {
                        x: [entry.metric],
                        y: [y_point] };
                }
            }
        });

        // Convert points into something suitable for plotly
        var data = [];
        for (var m in points) {
            data.push({ type: 'bar',
                x: points[m].y,
                y: points[m].x,
                orientation: 'h',
                name: m });
        }

        var sign_layout = {
            autosize: true,
            title: 'Signature Algorithm Distribution',
            yaxis: { type: 'category' },
            xaxis: { zeroline: true,
                    showline: false,
                    title: '% of domains'
            },
            margin: { l: 180, r: 50, b: 30, t: 30 }
        };

        Plotly.newPlot('sign_vis', data, sign_layout);
    });

    // Next Plot
    $.getJSON(dataSource5, function(data, textstatus) {
        var points = {};

        $.each(data, function(i, entry) {
            var x_point = moment(entry.date.substr(0, 10)).format('MMM YYYY');
            var y_point = +(100*(entry.count/totals[x_point])).toFixed(2);
            if (y_point > 1.0) {
                if (x_point in points) {
                    points[x_point]['x'].push(entry.metric);
                    points[x_point]['y'].push(y_point);
                }
                else {
                    points[x_point] = {
                        x: [entry.metric],
                        y: [y_point] };
                }
            }
        });

        // Convert points into something suitable for plotly
        var data = [];
        for (var m in points) {
            data.push({ type: 'bar',
                x: points[m].y,
                y: points[m].x,
                orientation: 'h',
                name: m });
        }

        var ca_layout = {
            autosize: true,
            title: 'CA Issuer Distribution',
            yaxis: { type: 'category' },
            xaxis: { zeroline: true,
                    showline: false,
                    title: '% of domains'
            },
            margin: { l: 180, r: 50, b: 30, t: 30 }
        };

        Plotly.newPlot('ca_vis', data, ca_layout);
    });

});
</script>
</div>]]></content:encoded></item><item><title><![CDATA[.nz DNS traffic: Trend and Anomalies]]></title><description><![CDATA[<div class="kg-card-markdown"><p>As we run the .nz ccTLD (Country Code Top Level Domain) authoritative nameservers, we receive lots of DNS queries and answer each query with a DNS response. We capture these queries and responses, and store them in a Hadoop cluster for further analysis. Based on this data, we generate daily</p></div>]]></description><link>https://blog.nzrs.net.nz/nz-dns-traffic-trend-and-anomalies/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60f9</guid><dc:creator><![CDATA[Jing Qiao]]></dc:creator><pubDate>Wed, 14 Jun 2017 04:56:52 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2017/06/magento-statistics-1.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2017/06/magento-statistics-1.png" alt=".nz DNS traffic: Trend and Anomalies"><p>As we run the .nz ccTLD (Country Code Top Level Domain) authoritative nameservers, we receive lots of DNS queries and answer each query with a DNS response. We capture these queries and responses, and store them in a Hadoop cluster for further analysis. Based on this data, we generate daily statistics about our DNS traffic and publish it in IDP (Internet Data Portal) as <a href="https://idp.nz/Domain-Names/-nz-DNS-Statistics/5a6u-t52b">.nz DNS Statistics</a>.</p>
<p>We have a clean, continuous dataset dating back to 2015. We are now able to apply some time series analysis to explore trends. This post will show some interesting results from that analysis.</p>
<h2 id="querytrend">Query trend</h2>
<p>A DNS query contains several attributes such as the domain being queried, the query type (the type of resource related to the domain) and the set of DNS header flags. Based on the aggregated counts of each attribute, we can explore the data across a variety of dimensions.</p>
<h3 id="registereddomainsqueried">Registered Domains Queried</h3>
<p><strong>IDP Dataset:</strong> <a href="https://idp.nz/Domain-Names/Unique-registered-domains-queried/u5i8-kxxg/data">Unique registered domains queried</a></p>
<p>We see lots of domains being queried in our traffic. Not all of them are registered domains, many do not exist in our register. Selecting by the response code in the DNS response message, we can extract the registered domains that were queried. As 'the number of unique registered domains queried' depends on the register size, we normalize it by dividing by the register size of each day, which can be obtained from <a href="https://idp.nz/Domain-Names/-nz-Activity/mm2r-3dj9">.nz registration statistics</a>.</p>
<p>Then we applied the Facebook forecasting library <a href="https://facebookincubator.github.io/prophet/">Prophet</a> to our data. Using the logistic growth trend model with carrying capacity of '1', we obtained the following result.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2017/06/reg_dn_trend.png" alt=".nz DNS traffic: Trend and Anomalies"><br>
From the plot, we can see the activity of the .nz namespace fits an upward trend over the past two years and is predicted to keep growing in the next year.</p>
<p>Prophet is based on an additive model where non-linear trends are fit with yearly and weekly seasonality. From the components plot below, we can see the trend, weekly variation and yearly seasonality.</p>
<p><img src="https://blog.nzrs.net.nz/content/images/2017/06/componetplot.png" alt=".nz DNS traffic: Trend and Anomalies"></p>
<p>The weekly and yearly seasonality are quite interesting. As our data is in UTC time, shifted 12 hours compared to NZ time, the weekly activity actually ramps up around Thursday and then stays high until Saturday. We presume the increased activity is partly due to the business queries on Thursday/Friday, and partly due to the weekend leisure queries on Friday/Saturday.</p>
<p>In the yearly subplot, we see a peak in March which could relate to the financial planning for the year (The financial year commonly finishes in March in NZ). And a decrease in the number of registered domains queried is across July, August and September, which could correlate with lower business activity during the winter months. Finally, the low point at Christmas time could be explained by holiday effect.</p>
<h3 id="querytypes">Query Types</h3>
<p><strong>IDP Dataset:</strong> <a href="https://idp.nz/Domain-Names/Query-Types/sgtp-vrup/data">Query types</a></p>
<p>Query type (the type of resource being asked for a domain) indicates the usage for the domain. Please refer to <a href="http://blog.nzrs.net.nz/characterization-of-popular-resolvers-from-our-point-of-view-2/">DNS in a Glimpse</a> for the definition of query types. We explore the query volume for each major query type to see how the usage of .nz domains evolves through the years. DS and DNSKEY are two major query types related to DNSSEC, so we show them in a separate plot as an indicator of the DNSSEC deployment progress.</p>
<p>We use <a href="https://plot.ly/">Plotly</a> for interactive plotting.</p>
 <div>
    <a href="https://plot.ly/~QiaoJing/7/?share_key=uFMlM3NVULZIB3KZsNfkfK" target="_blank" title="Plot 7" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/7.png?share_key=uFMlM3NVULZIB3KZsNfkfK" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 687px;" width="687" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:7" sharekey-plotly="uFMlM3NVULZIB3KZsNfkfK" src="https://plot.ly/embed.js" async></script>
</div>
<p>We can see the type of A and AAAA remain the top two query types asked for the .nz domains. Specifically,</p>
<ul>
<li><strong>Queries for A</strong> (mapping the IPv4 address for a domain) shows a steady growth with occasional spikes.</li>
<li><strong>Queries for AAAA</strong> (mapping the IPv6 address for a domain) experienced a strong growth across 2015 and had a steep drop in July 2016 and then caught up gradually. The drop in July 2016 is probably related to fixing a lot of AAAA queries for two of .nz nameservers as explained in this <a href="https://nzrs.net.nz/sites/default/files/The%20hunger%20for%20AAAA.pdf">presentation</a>.</li>
<li><strong>Queries for NS</strong> (locating the name server for a domain) were very small and then jumped up in Feb 2016, and remained steady at the higher level. Extremely high volumes were seen in early 2017. These anomalies will be explored later in this blog.</li>
<li><strong>Queries for MX</strong> (locating the mail server for a domain) should reflect the activity of sending emails to addresses within .nz namespace, including spamming. These volumes are steady with strong seasonality in a weekly and monthly level.</li>
</ul>
<div>
    <a href="https://plot.ly/~QiaoJing/20/?share_key=uFMlM3NVULZIB3KZsNfkfK" target="_blank" title="Plot 7 copy" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/20.png?share_key=uFMlM3NVULZIB3KZsNfkfK" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 687px;" width="687" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:20" sharekey-plotly="uFMlM3NVULZIB3KZsNfkfK" src="https://plot.ly/embed.js" async></script>
</div>
<ul>
<li><strong>Queries for DS</strong> (validating delegations by resolvers doing validation) shows a rising trend, which reflects the deployment progress of DNSSEC.</li>
<li><strong>Queries for DNSKEY</strong> (validating signed records) shows a slower rising trend. This type of queries normally should happen in the delegated zone. As the authoritative nameserver for mainly top/second level domains, we only see a small amount of DNSKEY queries.</li>
</ul>
<h3 id="rdbit">RD bit</h3>
<p><strong>IDP Dataset:</strong> <a href="https://idp.nz/Domain-Names/RD-bit/ek7h-ijay/data">RD bit</a></p>
<p>The DNS message header contains an RD (Recursion Desired) bit. Usually, it's set in the DNS query sent by the end user to the resolver. As the authoritative for the .nz namespace, most of the queries should not come with that bit set. That's why we don't expect to see lots of queries with RD bit set as shown in the plot below.</p>
<div>
    <a href="https://plot.ly/~QiaoJing/19/?share_key=cY8OExR4RhYClU5UFRL3Ep" target="_blank" title="Plot 19" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/19.png?share_key=cY8OExR4RhYClU5UFRL3Ep" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:19" sharekey-plotly="cY8OExR4RhYClU5UFRL3Ep" src="https://plot.ly/embed.js" async></script>
</div>
<p>We can see a big jump in Feb 2016 similar to the NS queries mentioned in the previous section. We will explore this anomaly later in this blog.</p>
<h2 id="networktrend">Network trend</h2>
<p>From our traffic, we can also see the source IP addresses and the network protocol they use to communicate with us such as UDP or TCP, IPv4 or IPv6. So we can explore the trend of the network protocols usage in our clients' infrastructure from our traffic. In this section, we draw the comparison plots in log scale, as the compared objects have a big difference in quantity.</p>
<h3 id="udpvstcp">UDP vs. TCP</h3>
<p><strong>IDP Dataset:</strong> <a href="https://idp.nz/Domain-Names/UDP-and-TCP/nqfr-qpez/data">UDP and TCP</a></p>
<p>The use of UDP and TCP in DNS is driven by message size and other factors as described in <a href="https://tools.ietf.org/html/rfc7766">RFC7766</a>:</p>
<blockquote>
<p>Most DNS [RFC1034] transactions take place over UDP [RFC768].  TCP<br>
[RFC793] is always used for full zone transfers (using AXFR) and is often used for messages whose sizes exceed the DNS protocol's<br>
original 512-byte limit.  The growing deployment of DNS Security<br>
(DNSSEC) and IPv6 has increased response sizes and therefore the use of TCP.  The need for increased TCP use has also been driven by the<br>
protection it provides against address spoofing and therefore<br>
exploitation of DNS in reflection/amplification attacks.  It is now<br>
widely used in Response Rate Limiting [RRL1] [RRL2].  Additionally,<br>
recent work on DNS privacy solutions such as [DNS-over-TLS] is<br>
another motivation to revisit DNS-over-TCP requirements.</p>
</blockquote>
<p>We compare the UDP and TCP trends in our traffic in two ways:</p>
<ul>
<li><strong>UDP vs. TCP query volume</strong></li>
<li><strong>The number of unique source addresses through UDP vs. TCP</strong></li>
</ul>
<div>
    <a href="https://plot.ly/~QiaoJing/9/?share_key=RNqbaCMNzLDT3K62Hkh4Px" target="_blank" title="Plot 9" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/9.png?share_key=RNqbaCMNzLDT3K62Hkh4Px" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:9" sharekey-plotly="RNqbaCMNzLDT3K62Hkh4Px" src="https://plot.ly/embed.js" async></script>
</div>
<div>
    <a href="https://plot.ly/~QiaoJing/11/?share_key=sBTK311SlyFKypPaJAmRER" target="_blank" title="Plot 11" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/11.png?share_key=sBTK311SlyFKypPaJAmRER" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:11" sharekey-plotly="sBTK311SlyFKypPaJAmRER" src="https://plot.ly/embed.js" async></script>
</div>
<p>From the two plots, we can see that both the query volume and unique source addresses through TCP increased significantly in the first half of 2016, and then stabilized. In contrast, the query volume over UDP showed slow growth through the years, but the number of unique source addresses through UDP decreased slightly.</p>
<p>In general, we found that the total number of unique source addresses has been decreasing since late 2015. As we have 3 name servers hosted by offshore providers that we don't capture the data for, we speculate that this reduction could be related to the traffic moving to other name servers offshore.</p>
<h3 id="ipv4vsipv6">IPv4 vs. IPv6</h3>
<p><strong>IDP Dataset:</strong> <a href="https://idp.nz/Domain-Names/IPv4-and-IPv6/eiys-uj9p/data">IPv4 and IPv6</a></p>
<p>We can do a similar comparison between IPv4 and IPv6 trend as below.</p>
<ul>
<li><strong>Query volume from IPv4 vs. IPv6 source addresses</strong></li>
<li><strong>The number of unique IPv4 vs. IPv6 source addresses</strong></li>
</ul>
<div>
    <a href="https://plot.ly/~QiaoJing/13/?share_key=bEmOkax0AhmcfB86DMs2Xr" target="_blank" title="Plot 13" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/13.png?share_key=bEmOkax0AhmcfB86DMs2Xr" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:13" sharekey-plotly="bEmOkax0AhmcfB86DMs2Xr" src="https://plot.ly/embed.js" async></script>
</div>
<div>
    <a href="https://plot.ly/~QiaoJing/15/?share_key=apXMCikXIfWwX96ESP6PIq" target="_blank" title="Plot 15" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/15.png?share_key=apXMCikXIfWwX96ESP6PIq" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:15" sharekey-plotly="apXMCikXIfWwX96ESP6PIq" src="https://plot.ly/embed.js" async></script>
</div>
<p>From 2016, the query volume from IPv6 addresses has grown as has the number of IPv6 source addresses. IPv4 query volume has grown more slowly, while the number of source addresses has decreased since 2016. The reason for this decrease may be similar to that mentioned in the analysis of UDP/TCP queries.</p>
<p>We have also investigated the weekly and yearly seasonality for each metric. As there are many different patterns, here we just show two typical examples related to IPv4 queries.</p>
<ul>
<li>Weekend off-peak is typically seen in some metrics of the query volumes and the number of unique source addresses. This reduction is probably due to lower business activity during the weekend.</li>
</ul>
<p><img src="https://blog.nzrs.net.nz/content/images/2017/06/v4_week.png" alt=".nz DNS traffic: Trend and Anomalies"></p>
<ul>
<li>The annual seasonality in the query volume from IPv4 addresses shows low points during winter and Christmas holiday.</li>
</ul>
<p><img src="https://blog.nzrs.net.nz/content/images/2017/06/v4_que_sea.png" alt=".nz DNS traffic: Trend and Anomalies"></p>
<h2 id="oneincreaseacrossmultiplemetrics">One increase across multiple metrics</h2>
<p>During the time series analysis, we found some simultaneous abrupt increases in different metrics shown below.</p>
<div>
    <a href="https://plot.ly/~QiaoJing/17/?share_key=oC1NhtI1NoeSB7hSkmG0FQ" target="_blank" title="Plot 17" style="display: block; text-align: center;"><img src="https://plot.ly/~QiaoJing/17.png?share_key=oC1NhtI1NoeSB7hSkmG0FQ" alt=".nz DNS traffic: Trend and Anomalies" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"></a>
    <script data-plotly="QiaoJing:17" sharekey-plotly="oC1NhtI1NoeSB7hSkmG0FQ" src="https://plot.ly/embed.js" async></script>
</div>
<p>This appears very anomalous, so we did some analysis to our raw data trying to find out what's happening. We located a bunch of source IP addresses that generated these increases. From 2016-02-04, each of these IP addresses began to send about 63k NS queries for non-existent domains every day, and the RD bit was set in each query. It increased to 91k later and remained at the level. Due to these queries, we have a great number of unique non-existent domains that are queried each day and continue to be queried.</p>
<p>We have checked these IP addresses in our query log, and found that none of them showed up until Feb 2016. Their sudden appearance, with such specific behavior, has already attracted our attention. We can monitor these source addresses and do further research to find out the reason so as to suggest the best solution.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Using the daily .nz DNS statistics, we undertook some time series analysis to show trends and anomalies in our DNS traffic. Interesting patterns are shown. Some of them are quite easy to explain, while others require further research. By sharing the daily DNS statistics as open data, we hope anyone who's interested can make use of it to better understand the Internet in New Zealand.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Register Size Prediction]]></title><description><![CDATA[<div class="kg-card-markdown"><p>In my previous two posts, we've seen how to model <a href="http://blog.nzrs.net.nz/domain-retention-prediction/">domain retention prediction</a> and <a href="http://blog.nzrs.net.nz/time-series-analysis-of-nz-activity-data/">new creates forecasting</a>. Those are essential model components required for a register size prediction model. In this post, I'll illustrate how the prediction procedure is constructed and some key results.</p>
<h3 id="predictionprocedure">Prediction Procedure</h3>
<p>Like any population size</p></div>]]></description><link>https://blog.nzrs.net.nz/register-size-prediction/</link><guid isPermaLink="false">5b6b11ef65ff1e00182f60f8</guid><dc:creator><![CDATA[Huayi Jing]]></dc:creator><pubDate>Wed, 24 May 2017 23:27:22 GMT</pubDate><media:content url="https://blog.nzrs.net.nz/content/images/2019/06/tom-parkes-Ns-BIiW_cNU-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.nzrs.net.nz/content/images/2019/06/tom-parkes-Ns-BIiW_cNU-unsplash.jpg" alt="Register Size Prediction"><p>In my previous two posts, we've seen how to model <a href="http://blog.nzrs.net.nz/domain-retention-prediction/">domain retention prediction</a> and <a href="http://blog.nzrs.net.nz/time-series-analysis-of-nz-activity-data/">new creates forecasting</a>. Those are essential model components required for a register size prediction model. In this post, I'll illustrate how the prediction procedure is constructed and some key results.</p>
<h3 id="predictionprocedure">Prediction Procedure</h3>
<p>Like any population size prediction problem, the key in register size prediction is to understand what the &quot;births&quot;(flows in) and &quot;deaths&quot;(flows out) are. The figure below conceptualizes register size changes for each month.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/prediction_procedure-1.png" alt="Register Size Prediction" style="width: 750px;">
<p>Each month some of the existing domains stop being active, which may then leave the register. Meanwhile, some new domains are created on the register. Hence, the calculation of register size for month t+1 can be summarized as:</p>
<p><em>Register size (t+1)   = Register size (t) + New creates(t+1) - Dropping out (t+1)</em></p>
<p>Where the dropping outs are modelled by the <a href="http://blog.nzrs.net.nz/domain-retention-prediction/">domain retention prediction</a>, and the number of new creates are modelled by the <a href="http://blog.nzrs.net.nz/time-series-analysis-of-nz-activity-data/">new creates forecasting</a>. The shifted Beta Geometric model (as used in retention modelling) gives us year-to-year retention rates. The retention rates for the remaining 11 months of a year are assumed to be 100%. This assumption is based on our observation that most of the domains are registered or renewed for 1 year. Although this means our prediction procedure overestimates the register size, we will see in the results that the errors are reasonably small.</p>
<h3 id="inputsresults">Inputs &amp; Results</h3>
<h4 id="retentionprediction">Retention Prediction</h4>
<p>The input of this step is simple. For instance, to make predictions for Jan 2017 onwards, all we need is the domains that are active in Dec 2016 and their age (i.e., how long they have been active). A sample of the input is shown below:</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/input-sample-3.png" alt="Register Size Prediction" style="width: 450px;">
<p>A finding from the results in the <a href="http://blog.nzrs.net.nz/domain-retention-prediction/">domain retention prediction</a> is that different SLDs have different retention behaviour. Hence retention prediction is done separately for each group - co.nz, org.nz, net.nz, and other SLDs. We fit the model with 12 year's historical data points of 12 cohorts (representing the domains created in each month of 2004); from which we calculate the 95% confidence interval retention rates shown below. As different SLDs are combined as one group, the variation in their retention behaviour is bigger, which can be seen from the larger spread in the 95% confidence interval.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/retention-rate-2.png" alt="Register Size Prediction" style="width: 750px;">
<p>In order to test the accuracy of retention prediction, the predicted retention rates are applied to domains that were active on 1st Dec 2016. The prediction of how many domains stay active in the register starts from Jan to Apr 2017. The 95% confidence interval in comparison with the actual values is shown in the following figures. Although predictions are slightly overestimated, the errors on average are around 1% which proves that the predictions are reasonably accurate.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/co.nz.no.png" alt="Register Size Prediction" style="width: 750px;">
<img src="https://blog.nzrs.net.nz/content/images/2017/05/org.no.png" alt="Register Size Prediction" style="width: 750px;">
<img src="https://blog.nzrs.net.nz/content/images/2017/05/nz.no.png" alt="Register Size Prediction" style="width: 750px;">
<img src="https://blog.nzrs.net.nz/content/images/2017/05/net.no.png" alt="Register Size Prediction" style="width: 750px;">
<img src="https://blog.nzrs.net.nz/content/images/2017/05/other.no.png" alt="Register Size Prediction" style="width: 750px;">
<h4 id="newcreates">New Creates</h4>
<p>In <a href="http://blog.nzrs.net.nz/time-series-analysis-of-nz-activity-data/">new creates forecasting</a>, I introduced how to do it using SARIMA model which needs parameter tuning. In Feb 2017, Facebook open sourced a Python/R package called <a href="https://research.fb.com/prophet-forecasting-at-scale/">Prophet </a>to automate the time forecasting process. It used the additive model which makes it computationally efficient. For those interested in finding more about Prophet, I recommend reading Facebook’s <a href="https://facebookincubator.github.io/prophet/static/prophet_paper_20170113.pdf">white paper</a>. It is used here to do new creates forecasting for different groups.</p>
<p>The following figure shows the new creates prediction for co.nz. The prediction captures the trend nicely. Looking at the test period starting from Jan 2016 to Apr 2017, we see most of the actual values (represented by red dots) are captured by the 95% confidence interval. A spike occurred in Apr 2017 which was caused by the end of reservation period for registration at the second level. Hence, it is relatively normal that the prediction didn’t capture that special event.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/co-forecast.png" alt="Register Size Prediction" style="width: 750px;">
<h4 id="registersize">Register size</h4>
<p>Now we are finally ready to predict the total number of active domains in the register! This is done by combining the number of domains that stay from last month plus the number of new creates in between. The figure below shows the predicted register size from Jan to Apr 2017 (at the beginning of each month) compared with the actual value. The results are fairly satisfying. The 95% confidence interval successfully covers actual register size of Feb and Apr. The absolute errors are all less than 1%. The underestimation in Apr was caused by the end of reservation period for registration at the second level.</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/total.no-1.png" alt="Register Size Prediction" style="width: 750px;">
<p>Knowing that the procedure is working, let’s check out the predictions from May 2017 up to the end of this financial year (the figure below shows the register size prediction at the beginning of each month):</p>
<img src="https://blog.nzrs.net.nz/content/images/2017/05/register.no.png" alt="Register Size Prediction" style="width: 850px;">
<h4 id="finalthoughts">Final thoughts</h4>
<p>Register size prediction is my first project after joining NZRS and I've learned a lot from it, e.g. working with Python and understanding the register behaviour. The prediction procedure can be further improved by investigating the minority domains that are registered / renewed on a monthly basis, and/ or revisiting the prediction if special events occur in the future. For practitioners who intend to implement this procedure, it is important to check the availability of input data since the models require specific details about the domains in the register.</p>
<p>Finally, I'd like to quote Nils Bohr who said &quot;Prediction is very difficult, especially if it's about the future&quot;. Although I've been working to make the prediction accurate, I will be pleased to see faster growth in the register size. So, let's work hard to prove me wrong!</p>
</div>]]></content:encoded></item></channel></rss>