In my previous two posts, we've seen how to model domain retention prediction and new creates forecasting. Those are essential model components required for a register size prediction model. In this post, I'll illustrate how the prediction procedure is constructed and some key results.
Like any population size prediction problem, the key in register size prediction is to understand what the "births"(flows in) and "deaths"(flows out) are. The figure below conceptualizes register size changes for each month.
Each month some of the existing domains stop being active, which may then leave the register. Meanwhile, some new domains are created on the register. Hence, the calculation of register size for month t+1 can be summarized as:
Register size (t+1) = Register size (t) + New creates(t+1) - Dropping out (t+1)
Where the dropping outs are modelled by the domain retention prediction, and the number of new creates are modelled by the new creates forecasting. The shifted Beta Geometric model (as used in retention modelling) gives us year-to-year retention rates. The retention rates for the remaining 11 months of a year are assumed to be 100%. This assumption is based on our observation that most of the domains are registered or renewed for 1 year. Although this means our prediction procedure overestimates the register size, we will see in the results that the errors are reasonably small.
Inputs & Results
The input of this step is simple. For instance, to make predictions for Jan 2017 onwards, all we need is the domains that are active in Dec 2016 and their age (i.e., how long they have been active). A sample of the input is shown below:
A finding from the results in the domain retention prediction is that different SLDs have different retention behaviour. Hence retention prediction is done separately for each group - co.nz, org.nz, net.nz, and other SLDs. We fit the model with 12 year's historical data points of 12 cohorts (representing the domains created in each month of 2004); from which we calculate the 95% confidence interval retention rates shown below. As different SLDs are combined as one group, the variation in their retention behaviour is bigger, which can be seen from the larger spread in the 95% confidence interval.
In order to test the accuracy of retention prediction, the predicted retention rates are applied to domains that were active on 1st Dec 2016. The prediction of how many domains stay active in the register starts from Jan to Apr 2017. The 95% confidence interval in comparison with the actual values is shown in the following figures. Although predictions are slightly overestimated, the errors on average are around 1% which proves that the predictions are reasonably accurate.
In new creates forecasting, I introduced how to do it using SARIMA model which needs parameter tuning. In Feb 2017, Facebook open sourced a Python/R package called Prophet to automate the time forecasting process. It used the additive model which makes it computationally efficient. For those interested in finding more about Prophet, I recommend reading Facebook’s white paper. It is used here to do new creates forecasting for different groups.
The following figure shows the new creates prediction for co.nz. The prediction captures the trend nicely. Looking at the test period starting from Jan 2016 to Apr 2017, we see most of the actual values (represented by red dots) are captured by the 95% confidence interval. A spike occurred in Apr 2017 which was caused by the end of reservation period for registration at the second level. Hence, it is relatively normal that the prediction didn’t capture that special event.
Now we are finally ready to predict the total number of active domains in the register! This is done by combining the number of domains that stay from last month plus the number of new creates in between. The figure below shows the predicted register size from Jan to Apr 2017 (at the beginning of each month) compared with the actual value. The results are fairly satisfying. The 95% confidence interval successfully covers actual register size of Feb and Apr. The absolute errors are all less than 1%. The underestimation in Apr was caused by the end of reservation period for registration at the second level.
Knowing that the procedure is working, let’s check out the predictions from May 2017 up to the end of this financial year (the figure below shows the register size prediction at the beginning of each month):
Register size prediction is my first project after joining NZRS and I've learned a lot from it, e.g. working with Python and understanding the register behaviour. The prediction procedure can be further improved by investigating the minority domains that are registered / renewed on a monthly basis, and/ or revisiting the prediction if special events occur in the future. For practitioners who intend to implement this procedure, it is important to check the availability of input data since the models require specific details about the domains in the register.
Finally, I'd like to quote Nils Bohr who said "Prediction is very difficult, especially if it's about the future". Although I've been working to make the prediction accurate, I will be pleased to see faster growth in the register size. So, let's work hard to prove me wrong!