An important task in text data mining is text classification/categorisation. The objective, is to start with a training set of documents (e.g. newspaper articles) labeled with a class (e.g. ‘sports’, ‘politics’) and then to determine a classification model to assign a correct class to a new document automatically. In machine learning (ML), this is a standard supervised learning problem.
As .nz registry, we collect some information about registrants during the registration process, including the registrant’s name, with no distinction between individuals or organisations. The registrant types is of great interest to us since it helps us to have a better understanding of the status of the register and, together with our domain industry classification and other information, create targeted campaigns in the future. Our objective is to have a classifier model that can automatically predict whether a name is a person or an organisation. This is a typical text classification problem. The steps of solving the problem are summarised below:
In the beginning, we explored a probabilistic approach using Name Entity Recognition with 2,000 manually classified names. Since the Natural Language Processing (NLP) libraries become more and more mature and the Deep Learning models are commonly used, it is a good idea to solve the problem with these techniques and see how we can benefit from them, which is the focus of this blog post.
Data and Preprocessing
We have 296,774 unique names in the data set collected in Feb 2017. The first step is text preprocessing. The length of names varies from 1 to 18 words (including numbers of symbols). The figure below shows that, 79.05% of the names are less than 4 words long. With a further look at the data, names that are more than 4 words long are mostly organisations. However, short names might be persons or organisations. For example, Lei Li, Job Ltd.
The histogram below shows the top 30 popular words. Since most organisation names end with ‘ltd’ or ‘limited’, it is not surprising to see them as the most popular ones. Other popular words includes locations(e.g. nz, New Zealand) and words indicating the service an organisation provides (e.g. solutions, trust). An interesting observation is that the most popular names are all male names.
As a benchmark, we first tried traditional machine learning models using Pipeline in Sklearn, which sequentially applies some transformers and a classifier. It is very easy to use and fast. We have 29,348 hand classified names. The training/testing split we used was 90/10. We tried SVM, Naive Bayes, and Logistic Regression as classifiers and the accuracy was 92.0%, 91.8% and 92.1% respectively.
Another method is to first transform words into vectors using Word2Vec or Doc2Vec and then train a machine learning algorithm (i.e., a classifier). We tried both of them with Gensim. Word2Vec is a machine learning algorithm based on neural network that can learn the relationship between words automatically. We take the average of the word vectors in a registrant name so that we can represent a name using just one vector. Doc2Vec can represent a whole document as vector automatically. We tried several different classifiers including SVM, Naive Bayes, Random Forest, Logistic Regression, KNN and MLP-NN. The accuracy for Word2Vec vectors ranges from 83% (Naive Bayes) to 93.3% (MLP-NN). For Doc2Vec vectors, the accuracy ranges from 77.9% (Naive Bayes) to 90.7% (MLP-NN).
Now, neural network is not new to us anymore since Word2Vec uses a shallow 2-layer neural network to produce word vectors. A deep neural network like Convolutional Neural Network (CNN) has more layers and is widely used in computer vision and NLP. The one we used has an embedding layer, followed by a convolutional, max-pooling and softmax layer. One can either train the word embeddings from scratch using CNN or use a retrained word embedding. Ye and Bayern (2015) found that using a pre-trained word embedding performed better than not using one. We used Google’s pre-trained word2vec embeddings and implemented the CNN with Tensorflow. With 71 parameter sets for grid search, the training took 5.5 hours and the best accuracy we get was 91.1%.
Another deep learning model we tried was the Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM). A good introduction on LSTM-RNN can be found here. RNN is known to perform well in text classification problem. It is designed to learn from sequences of data where time dependancy is important and therefore another application of RNN is to do time-series analysis. LSTM helps RNN to focus on certain part of a sequence and ignore unimportant words. It was implemented with Keras and the vocabulary was trained from scratch. The best accuracy we get is 92.7% after training 20 epochs for 2.2 hours.
Finally we tried the fastText developed by Facebook. The library is surprisingly very fast in comparison to other methods for achieving the same accuracy. The accuracy we get is 92.9%. The best accuracy from different models are summarised below.
Six registrant names (treated to protect registrants’ privacy) are selected to compare predictions generated from the trained models. They represent names (1) which are probably persons, e.g., “Jeremy Ashford”; (2) have single word in the name but still not hard to tell, e.g., “Jacqui”; (3) have important words in the name and hence very easy to predict, e.g., “Treecare Ltd”; (4) have single compound word and have important words in it, e.g., “Techsoft”; (5) seem to be a person name but not common english names, e.g., “Tan”; (6) have no clear meaning and it might be very hard for even a human to tell, e.g., “wjja”.
Solving the registrant classification problem has been a good opportunity for us to learn and apply different models used for text classification. There is still room for improvement. For example, we can apply grid search on parameters used in traditional machine learning models and LSTM-RNN to achieve higher accuracy. Also, if the prediction errors across different models are not highly correlated, there could be a good oppurtunity to benefit from ensembles. We are trying these things out to boost the accuracy and will share with you our final model.