Accessing the NZRS Internet Data Portal using Python and Pandas - Part 1

Introduction

The NZRS Internet data portal (IDP) makes available a collection of open data about the Internet in New Zealand. This is offered as public service and we encourage reuse of the data.

The platform offers a number of ways of exploring the data, with filter and visualisation options but ultimately people may want to pull that data out and use it in their analysis package of choice, whether that is something as simple as a spreadsheet or a more advanced toolset.

This blog post is to show some ways of using tools that NZRS use for accessing and analysing data, and how they can be used with the Internet data portal. We will be primarily using Python Pandas for its flexible and useful two dimensional 'dataframe' data structure. This blog post is not an example its actually the beginning of our internal analysis workflow when we are working with IDP data. This is part one of at least two parts and covers getting the data into Python and creating a simple pivot table.

NZRS use a variety of tools for analysis and visualisation, we use spreadsheets such as Microsoft Excel, R and various SQL implementations; we also use tools from the Python SciPy ecosystem. It's this that gives us Pandas and the more advanced data structures and analysis tools. Importantly it allows for reproducible research when combined with Jupyter notebooks. Jupyter notebooks allow us to document our analysis and make it available for review and reuse.

We tend to use the Anaconda Python distribution which you can get here, it comes with the most popular packages for science, math, engineering, and data analysis.

Extracting Data From the NZRS IDP

There are several ways of extracting data from the portal:

  • Using Microsoft OData which allows you to open refreshable data in applications
  • Using the SODA API which has libraries and SDKs for a stack of programming languages
  • Printing the data
  • Downloading the data in a number of formats, including JSON, Excel, CSV and RSS

Opening NZRS IDP Data in a Pandas Dataframe

We're going to look at the .nz registration activity. This is a summary of the .nz registry updated on a daily basis and includes state measures (e.g. count of domains) and transaction measures (e.g. number of domains registered in a given day).

Firstly we'll need a link to open. The platform provided does not provide a straightforward way to get a link, but we can copy the link location (right click > copy link location).

alt

We could filter the data before getting the link for opening but as this is a small data set there is no real reason to pre filter, we may as well put it all into a dataframe and use the functionality of the dataframe to chop it up if we need to.

Looking at the CSV link it looks something like this:

https://idp.nz/api/views/mm2r-3dj9/rows.csv?accessType=DOWNLOAD

There is no need for the download suffix in the URL so we can trim this to make it look more human readable in our code:

https://idp.nz/api/views/mm2r-3dj9/rows.csv

Before we start we want to include the necessary libraries. By convention and as habit we import Pandas and Numpy.

import pandas as pd  
import numpy as np  

While in Python we might typically open files using particular libraries for the file type, Pandas gives us a set of built in read methods that lets us read a 16 sources directly into a Pandas dataframe. This includes CSV that we are using but also SQL based tables, HTML files, JSON and the clipboard.

We can read the file as follows.

idp_url = 'https://idp.nz/api/views/mm2r-3dj9/rows.csv'

my_frame = pd.read_csv(idp_url)  

We can verify what we have read into the dataframe, we can check the columns are what we would expect to see and the frame is the size we would expect.

#check columns
my_frame.columns

>>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count'], dtype='object')

len(my_frame)

>>> 186255

Preparing the Dataframe for Further Analysis

So we now have a dataframe with data from the NZRS IDP. This is great but we're going to want to do some analysis on it. Often we'll be aggregating or summarising data at a monthly or annual level. We could apply a function to every chop or aggregation we do, but a useful thing to do is to add some additional columns to aggregate or pivot on. We are going to add the following:

  • Year
  • Month
  • Day
  • YRMO

Year, month and day are self explanatory but YRMO might look a little cryptic. YRMO is a combination of year and month and has the benefit of communicating both year and month in four characters and the chronological order is preserved in the numerical order.

On reading into the dataframe the Day field was read as a text string. We can check the type by selecting a column and row. The column by name and the row by index.

# Column = Day, second row 
print my_frame['Day'][2]

>>> 05/10/2015

# Type
print type(my_frame['Day'][2])

>>> <type 'str'>

We can see from the text string that leading 0's are used which indicates a fixed width is used for the dates, so rather than converting to a date and applying date functions we can use simple string manipulation.

Each 'column' in the data frame is actually a Numpy data series, which allows us to apply a function to the vector of the series. So we can write some functions to do this and apply them to the entire column, so we are not having to iterate over the dataframe.

# Year, month and day columns functions

def get_day(date_string):  
    return date_string[:2]

def get_month(date_string):  
    return date_string[3:5]

def get_year(date_string):  
    return date_string[6:10]

We can now apply the above functions to the dataframe and create three new columns. We could have alternatively used Python lambdas but some of us have preference for verbosity.

#Create three new columns using previously defined functions

my_frame['day'] = my_frame['Day'].apply(get_day)  
my_frame['month'] = my_frame['Day'].apply(get_month)  
my_frame['year'] = my_frame['Day'].apply(get_year)

We can now see we have an additional three columns with the names 'day', 'month' and 'year'.

my_frame.columns

>>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count', u'day', u'month', u'year'], dtype='object')

Now we can add our YRMO number to the dataframe by concatenating values in two columns.

# Add a YRMO number

my_frame['yrmo'] = my_frame['year'] + my_frame['month']

Again we can see we have the new column.

my_frame.columns

>>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count', u'day', u'month', u'year', u'yrmo'], dtype='object')

At this stage we want to get a bit closer to the data and make sure its looking like we want expect it to. We have 186,255 rows of data, which is manageable but we probably don't want to print the entire dataframe to see what is there, so we have some options.

  • Examine the columns again
  • Look at n rows from the dataframe to see what it looks like
  • Count unique values in each column
  • Look at what unique values are in columns
  • Use the describe method
## Examine and Check the Dataframe ##

# Examine columns
print my_frame.columns

>>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count', u'day', u'month', u'year', u'yrmo'], dtype='object')

# Examine rows 4 to 6
print my_frame.loc[4:6]

>>>           
Day Parent Level Domain      Measure  Measure Type  Count day month  \  
4  05/10/2015              cri.nz  Uncancelled  Transactions      0  05    10  
5  05/10/2015              cri.nz  Transferred  Transactions      0  05    10  
6  05/10/2015               co.nz      Created  Transactions    249  05    10   

   year    yrmo  
4  2015  201510  
5  2015  201510  
6  2015  201510  

# Count unique values by column

for column in my_frame.columns:  
    print column, my_frame[column].nunique()

>>> 
Day 1371  
Parent Level Domain 16  
Measure 9  
Measure Type 2  
Count 8253  
day 31  
month 12  
year 4  
yrmo 46


# Explore unique values in an individual column

print my_frame['Measure'].unique()

>>> ['Transferred' 'Renewed' 'Cancelled' 'Released' 'Uncancelled' 'Created'
 'Due For Renewal' 'Billed Domains' 'Pending Release']

print my_frame['Measure Type'].unique()

>>> ['Transactions' 'State']


print my_frame.describe()

>>> 
               Count
count  186255.000000  
mean     4259.802212  
std     39429.309969  
min         0.000000  
25%         0.000000  
50%         1.000000  
75%        35.000000  
max    481761.000000

From this we can tell a few things that will shape how we use the data. Firstly we have two types of measure, transaction and state. As we will likely be aggregating we need to treat these differently. If we were looking to aggregate by month we would need to apply different operations.

If we were to look at total transactions in a month we would have to sum them by month, if we wanted to know the state in a month we would need to take a measure at a point in time (say the first or last day of the month). It makes some sense to work with this data in separate dataframes which we can easily do.

Further Process the Dataframe for Analysis

# Create two separate frames one for transactions and one for state

transaction_frame = my_frame[my_frame['Measure Type'] == 'Transactions']  
state_frame = my_frame[my_frame['Measure Type'] == 'State']

# Check the new frames are unique
print transaction_frame['Measure Type'].unique()  
print state_frame['Measure Type'].unique()

>>>

['Transactions']
['State']

Now we can start doing some more interesting stuff with the two dataframes.

Displaying a Pivot Table For Transaction Measures

A pivot table allows for quick and easy data summarisation. If we wanted to know creates over time we could create a pivot table to summarise it. In the example below we do the following:

  • Filter the dataframe to only values where 'Measure' = 'Created'
  • Display years as columns
  • Aggregate on Parent Level Domain and Measure (though there is only one Measure)
  • Apply and aggregate function in this case 'sum'
my_pivot = pd.pivot_table(transaction_frame[transaction_frame['Measure'] == 'Created'], values='Count', index=['Measure','Parent Level Domain' ],  
                    columns=['year'], aggfunc=np.sum)


print my_pivot  
>>>


year                           2012    2013   2014   2015  
Measure Parent Level Domain  
Created ac.nz                   369     314    383    242  
        co.nz                104365  100852  92755  62798
        cri.nz                    1       0      0      0
        geek.nz                 311     217    172     63
        gen.nz                  255     126    117     60
        govt.nz                  38      21     16     13
        health.nz                 4      38      9     23
        iwi.nz                    5      11      7     11
        kiwi.nz                8026    2558   2320    574
        maori.nz                667     240    209    106
        mil.nz                    3       3      3      1
        net.nz                 7151    6331   4897   2595
        nz                      NaN     NaN  43220  61909
        org.nz                 4843    4604   4235   2501
        parliament.nz             1       1      0      1
        school.nz               299     260    306    173

There will be some follow up blog posts, with some more advanced analysis and some more interesting examples. In this post we have shown how we read data into Python ready to begin manipulating it.