The NZRS Internet data portal (IDP) makes available a collection of open data about the Internet in New Zealand. This is offered as public service and we encourage reuse of the data.
The platform offers a number of ways of exploring the data, with filter and visualisation options but ultimately people may want to pull that data out and use it in their analysis package of choice, whether that is something as simple as a spreadsheet or a more advanced toolset.
This blog post is to show some ways of using tools that NZRS use for accessing and analysing data, and how they can be used with the Internet data portal. We will be primarily using Python Pandas for its flexible and useful two dimensional 'dataframe' data structure. This blog post is not an example its actually the beginning of our internal analysis workflow when we are working with IDP data. This is part one of at least two parts and covers getting the data into Python and creating a simple pivot table.
NZRS use a variety of tools for analysis and visualisation, we use spreadsheets such as Microsoft Excel, R and various SQL implementations; we also use tools from the Python SciPy ecosystem. It's this that gives us Pandas and the more advanced data structures and analysis tools. Importantly it allows for reproducible research when combined with Jupyter notebooks. Jupyter notebooks allow us to document our analysis and make it available for review and reuse.
We tend to use the Anaconda Python distribution which you can get here, it comes with the most popular packages for science, math, engineering, and data analysis.
Extracting Data From the NZRS IDP
There are several ways of extracting data from the portal:
- Using Microsoft OData which allows you to open refreshable data in applications
- Using the SODA API which has libraries and SDKs for a stack of programming languages
- Printing the data
- Downloading the data in a number of formats, including JSON, Excel, CSV and RSS
Opening NZRS IDP Data in a Pandas Dataframe
We're going to look at the .nz registration activity. This is a summary of the .nz registry updated on a daily basis and includes state measures (e.g. count of domains) and transaction measures (e.g. number of domains registered in a given day).
Firstly we'll need a link to open. The platform provided does not provide a straightforward way to get a link, but we can copy the link location (right click > copy link location).
We could filter the data before getting the link for opening but as this is a small data set there is no real reason to pre filter, we may as well put it all into a dataframe and use the functionality of the dataframe to chop it up if we need to.
Looking at the CSV link it looks something like this:
There is no need for the download suffix in the URL so we can trim this to make it look more human readable in our code:
Before we start we want to include the necessary libraries. By convention and as habit we import Pandas and Numpy.
import pandas as pd import numpy as np
While in Python we might typically open files using particular libraries for the file type, Pandas gives us a set of built in read methods that lets us read a 16 sources directly into a Pandas dataframe. This includes CSV that we are using but also SQL based tables, HTML files, JSON and the clipboard.
We can read the file as follows.
idp_url = 'https://idp.nz/api/views/mm2r-3dj9/rows.csv' my_frame = pd.read_csv(idp_url)
We can verify what we have read into the dataframe, we can check the columns are what we would expect to see and the frame is the size we would expect.
#check columns my_frame.columns >>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count'], dtype='object') len(my_frame) >>> 186255
Preparing the Dataframe for Further Analysis
So we now have a dataframe with data from the NZRS IDP. This is great but we're going to want to do some analysis on it. Often we'll be aggregating or summarising data at a monthly or annual level. We could apply a function to every chop or aggregation we do, but a useful thing to do is to add some additional columns to aggregate or pivot on. We are going to add the following:
Year, month and day are self explanatory but YRMO might look a little cryptic. YRMO is a combination of year and month and has the benefit of communicating both year and month in four characters and the chronological order is preserved in the numerical order.
On reading into the dataframe the Day field was read as a text string. We can check the type by selecting a column and row. The column by name and the row by index.
# Column = Day, second row print my_frame['Day'] >>> 05/10/2015 # Type print type(my_frame['Day']) >>> <type 'str'>
We can see from the text string that leading 0's are used which indicates a fixed width is used for the dates, so rather than converting to a date and applying date functions we can use simple string manipulation.
Each 'column' in the data frame is actually a Numpy data series, which allows us to apply a function to the vector of the series. So we can write some functions to do this and apply them to the entire column, so we are not having to iterate over the dataframe.
# Year, month and day columns functions def get_day(date_string): return date_string[:2] def get_month(date_string): return date_string[3:5] def get_year(date_string): return date_string[6:10]
We can now apply the above functions to the dataframe and create three new columns. We could have alternatively used Python lambdas but some of us have preference for verbosity.
#Create three new columns using previously defined functions my_frame['day'] = my_frame['Day'].apply(get_day) my_frame['month'] = my_frame['Day'].apply(get_month) my_frame['year'] = my_frame['Day'].apply(get_year)
We can now see we have an additional three columns with the names 'day', 'month' and 'year'.
my_frame.columns >>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count', u'day', u'month', u'year'], dtype='object')
Now we can add our YRMO number to the dataframe by concatenating values in two columns.
# Add a YRMO number my_frame['yrmo'] = my_frame['year'] + my_frame['month']
Again we can see we have the new column.
my_frame.columns >>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count', u'day', u'month', u'year', u'yrmo'], dtype='object')
At this stage we want to get a bit closer to the data and make sure its looking like we want expect it to. We have 186,255 rows of data, which is manageable but we probably don't want to print the entire dataframe to see what is there, so we have some options.
- Examine the columns again
- Look at n rows from the dataframe to see what it looks like
- Count unique values in each column
- Look at what unique values are in columns
- Use the describe method
## Examine and Check the Dataframe ## # Examine columns print my_frame.columns >>> Index([u'Day', u'Parent Level Domain', u'Measure', u'Measure Type', u'Count', u'day', u'month', u'year', u'yrmo'], dtype='object') # Examine rows 4 to 6 print my_frame.loc[4:6] >>> Day Parent Level Domain Measure Measure Type Count day month \ 4 05/10/2015 cri.nz Uncancelled Transactions 0 05 10 5 05/10/2015 cri.nz Transferred Transactions 0 05 10 6 05/10/2015 co.nz Created Transactions 249 05 10 year yrmo 4 2015 201510 5 2015 201510 6 2015 201510 # Count unique values by column for column in my_frame.columns: print column, my_frame[column].nunique() >>> Day 1371 Parent Level Domain 16 Measure 9 Measure Type 2 Count 8253 day 31 month 12 year 4 yrmo 46 # Explore unique values in an individual column print my_frame['Measure'].unique() >>> ['Transferred' 'Renewed' 'Cancelled' 'Released' 'Uncancelled' 'Created' 'Due For Renewal' 'Billed Domains' 'Pending Release'] print my_frame['Measure Type'].unique() >>> ['Transactions' 'State'] print my_frame.describe() >>> Count count 186255.000000 mean 4259.802212 std 39429.309969 min 0.000000 25% 0.000000 50% 1.000000 75% 35.000000 max 481761.000000
From this we can tell a few things that will shape how we use the data. Firstly we have two types of measure, transaction and state. As we will likely be aggregating we need to treat these differently. If we were looking to aggregate by month we would need to apply different operations.
If we were to look at total transactions in a month we would have to sum them by month, if we wanted to know the state in a month we would need to take a measure at a point in time (say the first or last day of the month). It makes some sense to work with this data in separate dataframes which we can easily do.
Further Process the Dataframe for Analysis
# Create two separate frames one for transactions and one for state transaction_frame = my_frame[my_frame['Measure Type'] == 'Transactions'] state_frame = my_frame[my_frame['Measure Type'] == 'State'] # Check the new frames are unique print transaction_frame['Measure Type'].unique() print state_frame['Measure Type'].unique() >>> ['Transactions'] ['State']
Now we can start doing some more interesting stuff with the two dataframes.
Displaying a Pivot Table For Transaction Measures
A pivot table allows for quick and easy data summarisation. If we wanted to know creates over time we could create a pivot table to summarise it. In the example below we do the following:
- Filter the dataframe to only values where 'Measure' = 'Created'
- Display years as columns
- Aggregate on Parent Level Domain and Measure (though there is only one Measure)
- Apply and aggregate function in this case 'sum'
my_pivot = pd.pivot_table(transaction_frame[transaction_frame['Measure'] == 'Created'], values='Count', index=['Measure','Parent Level Domain' ], columns=['year'], aggfunc=np.sum) print my_pivot >>> year 2012 2013 2014 2015 Measure Parent Level Domain Created ac.nz 369 314 383 242 co.nz 104365 100852 92755 62798 cri.nz 1 0 0 0 geek.nz 311 217 172 63 gen.nz 255 126 117 60 govt.nz 38 21 16 13 health.nz 4 38 9 23 iwi.nz 5 11 7 11 kiwi.nz 8026 2558 2320 574 maori.nz 667 240 209 106 mil.nz 3 3 3 1 net.nz 7151 6331 4897 2595 nz NaN NaN 43220 61909 org.nz 4843 4604 4235 2501 parliament.nz 1 1 0 1 school.nz 299 260 306 173
There will be some follow up blog posts, with some more advanced analysis and some more interesting examples. In this post we have shown how we read data into Python ready to begin manipulating it.