# Domain retention prediction using DTMC

Being able to predict domain name retention / renewal is an important tool for registries to better understand the behaviour of domain names in the register. The prediction model can also be used to forecast the future value of a domain name and the income from domain renewal. In this post, we will see how the behaviour of a domain name can be modelled using a Markov Chain model.

## Memoryless Beauty: Markov Chain

Markov process, named for Andrei Markov, is a stochastic process that has *Markov property* which states that the future state of the process is independent of the past, given the present. It has been regarded as a simple and classic way to analyse complicated stochastic processes because of its 'memoryless' property.

To better understand how it works, let us first look at a very interesting example which can be naturally modelled by a *Discrete Time Markov Chain* (DTMC).

### Gambler ruin problem

Let's consider a gambler who has \$100. He bets \$1 each round of the game, and wins with probability of 0.5. He will stop playing when he either gets broke or wins \$1000. You may ask: what's the probability that he goes broke? or, in the long run, how many games will he play before he stops? Such questions can be answered by modelling it with DTMC.

The amount of money he has at the end of each game is called the *state* of the process. Here we have 1001 possible states that forms the *state space*. The transition from a state to another can be represented by a *States transition diagram* shown below. The nodes represents the states and the number over the arrows stands for the *transition probability*.

It can be seen that the states 1,2,...,999 can *communicate* with each other since one state is *reachable* from another. On the other hand, state 0 (broke) and state 1000 are absorbing since there's no escape from those states. They are also called the *closed class*. A markov chain is said to be *irreducible* if all the states communicate with each other. For the Gambler Ruin problem, it is not irreducible since we have two closed classes.

We can also list the transition probabilities in a *transition probability matrix*. Assuming we have a state space with states 1,2,...,*r*, then the transition matrix denoted by $P$ is determined by the following.

$$
P=

\begin{bmatrix}
P_{11} & P_{12} & \cdots & P_{1r} \\
P_{21} & P_{22} & \cdots &P_{2r} \\
\vdots & \vdots & \ddots & \vdots \\
P_{r1} & P_{r2} & \cdots & P_{rr} \\
\end{bmatrix}
$$

Given that we are in state $i$, the next step must be in one of the possible states. Hence, each row in the transition matrix must sum to one. That is:

$$\sum_{k=1}^{r} P_{ik} = \sum_{k=1}^{r}P(X_{m+1}=k|X_{m}=i)=1, P_{ij}\geq 0, \forall i$$

## Domain Retention Prediction

Now we are ready to investigate the domain retention prediction problem using DTMC. There are four possible states of a domain name: **Active, Pending Release, Expired** and do-not- exist. We are interested in the probability of a domain to stay in the register. Since the do-not-exist state is not reachable from other states and the transit from the do-not-exist is beyond the scope of this study, it is not considered here. The transition diagram is shown below. A simplified lifecycle diagram can be found on GetYourselfOnline.

From the diagram it can be seen that all the states communicate with each other, therefore it is an irreducible chain. Moreover, note that there is a positive probability of remaining in a state if the chain starts with that state. A beautiful thing about an irreducible markov chain is that it has a *stationary distribution*, which is the long term behaviour of the chain. The stationary distribution $\pi$ is a vector obtained by solving a set of linear equations $\pi = \pi P$.

Given the sequence data set containing states of domain names at different time points, we can estimate the transition probability matrix by using the following equation:

$$ \hat{P} _{ij} = \frac{n_{ij}}{\sum_{k=1}^{r}n_{ik}} $$

Where $n_{ij}$ denotes the number of times the chain moved from state $i$ to $j$. The denominator, $\sum_{k=1}^{r}n_{ik}$ is the total number of transitions out of state $i$. Estimating the entries like this also corresponds to the maximum likelihood estimator of the transition probability matrix.

Having these in mind, we can do predictive analysis to answer questions like:

1. Given that a domain name is active today, what is the probability that it is active / pending release / expired after 3 months, 6 months, 9 months?

2. If a new domain name is registered today, what fraction of its life will it spend in each of the three states?

## Next Steps

In this post, we have explored DTMC and its application on domain retention prediction. Some preliminary results have been obtained from a simulation using Python and a data sample from the .nz registry. Another interesting idea for predictive modelling is to use a Markov chain classifier. A portion of the factors considered included:

- Registration Information
- Status, purpose, and information of the website for the domain name
- Historical information, e.g., length of time being active in the register, website traffic, total number of renewals
- Setup indicators, e.g. the domain has a nameserver, web server or mail server in use

One main objective of the domain name retention study is to find factors mostly influencing the probability of renewal, i.e., the KPIs associated with domain names. With that knowledge, registrars can better understand their customers and provider better service to attract high-value domain registrations to increase income. A follow-up blog post will discuss results and insights in details.