Customer-base analysis has become increasingly important. A critical part of the analysis is the prediction of retention rate, which is defined as the proportion of customers active at the end of period $t-1$ who are still active at the end of period $t$. Retention prediction is commonly used for customer lifetime value calculation. For the .nz registry, the retention rate of domains is a crucial metric used for financial planning.

Several probabilistic models have been developed for retention prediction. In a survey paper written by Peter and Bruce, customer’s relationship with a company can be classified into four types (see Figure 1). In the domain name industry, there are two characteristics:

  • Whether a domain name is renewed or not can be observed from the registry's database;
  • A domain name is usually renewed for a certain length of period which can vary from 1 month to 120 months, 1 year is the most common.

Hence, the domain business belongs to the contractual and discrete type according to this two-dimensional classification. One commonly used retention prediction model for this type is the shifted-Beta-Geometric (sBG) model developed by Peter and Bruce in 2007. In this post, the objective is to predict domain retention probability of the .nz registry using the sBG model.

The sBG Model

The sBG model is based on two assumptions and I describe them in domain name service setting:

  • A domain name remains active with constant retention probability $1-\theta$. From the definition of survivor function, we have:

    a.  The probability of churn (the domain will not be renewed at $t$): 


b. The survivor function (probability that the domain is still active at $t$):

  • Heterogeneity in $\theta$ is modeled by Beta distribution with the pdf:
    $$f(\theta|\alpha, \beta)=\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}$$
    where $B(\alpha,\beta)$ is the Beta function.

Individual domain's value of $\theta$ is unobserved (not measurable from the dataset), therefore the expectation over the Beta distribution (e.g., $E[P(T=t|\Theta=\theta)]$) is used to get a randomly chosen domain’s probability of churn and survivor function.

Compared with models that assume a constant churn rate for all customers, the advantage of the sBG model is that it takes customer heterogeneity into account. After some transformation, the retention rate is expressed in the following concise form:

Domain Retention Prediction

The implementation of the sBG model lies in the estimation of the two parameters: $\alpha$ and $\beta$. In their paper (Appendix B), Peter and Bruce showed how to implement the model and compute the maximum likelihood estimates in Excel. We follow the same procedure and code it in Python (code can be found here) to find the $\hat{\alpha}$ and $\hat{\beta}$. The survival data shown in Table 1 are for the domains registered in April 2004 at three parent levels:, and


We fit the sBG model to the first 6,7,8 years of the data for each parent level respectively to compare the accuracy of the estimation. The parameter estimation results are summarized in Table 2. Using these parameter estimates, the survivor function for each parent level is extrapolated out to year 12. The model-based results along with the actual numbers are plotted in Figure 1. The resulting predictions for the survival probability are quite accurate, especially when more data points are used to do estimation. It can be seen that the survival probability is decreasing and the decreasing rate gets smaller with time, meaning that less and less of domains stop being renewed. Another observation is that the survival probability of is higher than and Hence, this model can help us to identify groups of domains with different retention behaviour so that we can analyse the characteristics shared within each group.

Drawing Drawing

Another interesting plot is the retention rate. The model-based retention rates and the actual numbers are plotted in Figure 3. Although the model does not track the actual data perfectly, it fits the data on average and captures the trend. There are two reasons: firstly, our data points fluctuate a lot due to possible special events which makes the estimation harder; secondly, although the survival probability and the retention rate are closely related, the survival probability is easier to predict since it has a cumulative form which makes it less sensitive to period-to-period variations. It can be seen that the retention rates are increasing and the retention rate of is higher. In fact, this is similar to our observation found from the data. Figure 4 shows the retention rates (proportion of domains still active in the next year) for domains created at different periods. It can be seen that the mean retention rates are increasing with time and the domains with age of over 6 years have average retention rates over 0.90. On the other hand, retention rates of newly created domains have a much larger spread compared with older domains, indicating domains are unstable in early age.

Drawing Drawing


In this post, the sBG model is implemented to fit the data of domains at different parent levels. The sBG model is a simple and powerful. It is interesting to see different retention behaviour among different parent levels. This model will be very helpful in identifying domains with different behaviours so that we can the factors/drivers behind.