The number of new domain creations is an important indicator of the registry activity. Using an appropriate method to predict the number of creations in the future helps us to better understand registry behaviour and plan wisely. One of the most popular forecasting methods is Time Series Modelling, which is used to find meaningful statistics and useful characteristics from time-based (years, months, days) data.

The .nz Activity data set on Internet Data Portal keeps daily aggregated registration statistics for .nz domains. This article presents a time series analysis of .nz Activity data since 2012 using SARIMAX model in state space method of Python. While the .nz Activity data contains data of different kinds of measures, this article focuses on the monthly count of newly created domains. A few transformations are made to the data so that it can be read as a time series object. The Python code can be found here.

## Visualize the data

The first step is to visualize the data to have a general understanding. It is important to discover the overall trend and/or seasonal trend so that we know which type of model to use. Note that we use data from January 2012 to December 2015 to build the model so that we can evaluate the performance of out-of-sample forecasting using data of 2016.

As Figure 1 indicates, there are some spikes in our data point. Particularly, the creations in September / October 2014 and March 2015 are higher than other months. Taking a further look at the data, it shows that: the jump in September 2012 was due to the launch of kiwi.nz; the spikes in September/October 2014 were due to registrations being allowed directly at the second level and at the end of March 2015 it was the end of the preferential registration and reservation period for second level .nz domains. Since these are all special events, they should be treated as outliers. Our observations are also supported by the boxplot of the data, see Figure 2.

The five outliers here are caused by special events and using a technique for dealing with outliers found here, we replace these outliers with the mean, which is 11396. The data after this treatment is shown in Figure 3.

As we visualize the number of creations per month, we can see that there is an upward trend and there is a seasonality to it. Figure 4 shows plots generated from the seasonal decompose function which deconstructs a time series into several components. The upward trend starts from March 2014 and the seasonality is every 12 months.

## Make it stationary

The next step is to make the time series stationary, i.e., the mean and variance of our data points should be constant over time. Why is stationarity important? Most statistical models (e.g. linear regression model) are based on the assumption that the underlying data can be stationarized, so that the statistics (e.g. mean, variance, covariance) from the past can be used to predict the future. Otherwise, considering a data series with an upward trend as an example, its mean in the future will always be underestimated if the current mean is used as a statistical descriptor.

From the decomposition results, we already know there are a trend and seasonality. Another way to test it is to use rolling statistics and *Dickey-Fuller test*. It can be seen from Figure 5 since the trend is not very strong, the rolling mean and standard deviation does not vary too much over time. The p-value is $2.99e^{-3}$ which is not high.

Now let’s see if any improvement in the stationarity can be achieved. Discussions of several data transformation methods can be found here .

One commonly used method is differencing. The *first difference* of a time series is a series of changes from a period to the next. Hence, differencing can help us to eliminate the impact of the trend. By changing the number of periods, we can get the *seasonal difference* as well. Sometimes, both of the two transformations are used and we get *the first difference of seasonal difference*. For example, the first difference of seasonal difference in March 2014 is the equal to the March-February difference in 2014 minus March-February difference in 2013, given that the length of a season is 12 months. We take the first difference and the first difference of seasonal difference of the time series and compare their results.

It can be seen from Figure 6 that, there are still some obvious signs of seasonality (see data points around November 2012 and November 2014 as an example). The plot for the first difference of the seasonal difference is more stationary. Supporting evidence can be found by looking at the smaller p-value which is $1.48e^{-9}$. We now have a rough idea about how to set the parameters for the model. Next, we will plot some charts to further assist us to find optimal parameters.

## The Seasonal ARIMA model

The Autoregressive Integrated Moving Averages (ARIMA) models are the most general class of models for forecasting a time series that can be stationarized. More information about ARIMA is available here. The forecasts are based on three parameters:

- $p$: the number of autoregressive (AR) terms, i.e., the number of lagged dependent variables.
- $d$: the number of non-seasonal differences needed to make the time series stationary.
- $q$: the number of lagged forecast errors for the moving average (MA) term.

A Seasonal ARIMA model is denoted by $(p,d,q)\times(P,D,Q)_s$, where $s$ is the number of periods per season and $(P,D,Q)$ are seasonal terms. As discussed above, we know $d=1$ since the impact of the upward trend is reduced after taking the first difference. Also, $D=1$ since the time series becomes further stationarized after taking the first difference of seasonal difference. *Autocorrelation* and *partial autocorrelation* graphs can help us with other parameters. Figure 8 shows the two plots for stationarized data.

Now we can identify our seasonal ARIMA model by observing the behaviour of the two plots, according to the rules for identifying ARIMA models . In the PACF, there’s a negative spike at lag 12 which may suggest an AR term for the seasonal part of the model. In the non-seasonal legs of PACF, there is a spike at lag 13 which suggest an AR term for the non-seasonal part of the model. Different combinations of these parameters are tried out to fit the model and the one with the lowest AIC value is selected, which is a (1,1,0)*(2,1,0,12) with AIC = 593.

## Predictions and Forecasting

With the model built, we can make predictions and forecasting based on our data. Data observed from January - August 2016 is used to test the accuracy of our model. The *predict* command can generate the *one-step-ahead* prediction and *dynamic* prediction. One-step-ahead prediction always uses observed values to predict the next in-sample value, whereas Dynamic prediction is equal to one-step-ahead prediction only up to a certain data point and then the previous predicted value will be used to make predictions.

Here the dynamic prediction is performed starting in the October 2015. The mean value of predictions and their 95% confidence interval are shown in Figure 9. It can be seen that the one-step-ahead is slightly better. However, since it uses actual observed value for each subsequent prediction, the one-step-ahead prediction can only be used for in-sample testing. When we want to do out-of-sample forecasting, the dynamic prediction must be used since the actual value is not available. As shown in Figure 9, our model can predict pretty well. The prediction error is shown in Figure 10. The Mean Squared Error of the one-step-ahead and Dynamic prediction is 58112 and 143150 respectively.

Finally, we can make the out-of-sample forecast for September 2016 - July 2017, which is shown in Figure 11. It can be seen that the forecasted value captures both the seasonal and upward trends.

## Conclusion

Through this blog we have explored the ARIMA model and the statsmodel package in Python by creating a time series forecast for .nz activity data. The principle advantage of ARIMA is the comprehensiveness of the family of models. Features behind the data can be nicely captured by identifying the model appropriately. This procedure can be easily applied to other time series data, e.g., a different ccTLD data, with similar data available.