For The Love Of Data Science: April 2016

Text mining refers to analytical procedures that turn text data into high-quality, actionable knowledge. It facilitates applications where human effort is minimized and decision-making ability is maximized. How text is represented inside a computer dictates what algorithms can be applied to derive meanings and inferences. Some ways a sentence can be represented are, in order of increasing complexity, as a sequence of words, words plus part-of-speech tags, syntactic structures (e.g., noun phrases, verb phrases), entities and relations, or logic predicates. Incorporating higher-level structures into text analysis requires progressively more human input and is prone to higher inaccuracy.

Part-of-speech (POS) tagging is one of the simplest ways to represent text. The tagging process, however, is nontrivial. For example, the part of speech a word represents can depend on the context in which the word is used (e.g., "Throw the book at him, Sheriff" versus "Book him, Chief"). At it's inception, POS tagging was done by hand. The field has sufficiently progressed that POS-taggers have been trained using advanced algorithms. State-of-the-art POS taggers have an accuracy of about 97%.

It strikes me that because some kinds of documents have very different writing styles (e.g., fictional novels versus scientific papers), they may have different POS fractions. If so, it is perhaps possible that one could distinguish various classes of documents with a vector of POS fractions alone. I explore this issue here.

Sample

I start by assembling a corpus of four very different document types.

Classic novels from Project Gutenberg (1,017,969 words):
Ten novels from Project Gutenberg were picked randomly from the site's list of most popular fiction. My selection includes books like "Adventures of Huckleberry Finn", "Dracula", and "Alice's Adventures in Wonderland". I downloaded the text in ASCII format directly from Project Gutenberg.

Astrophysics journal articles (285,785 words):
For this category, I used my PhD thesis plus 10 additional journal articles cited in my thesis. I converted the PDF-formatted files to ASCII with the pdftotext shell command.

Yelp restaurant reviews (1,030,909 words):
This sample of 10,000 Yelp reviews comes in ASCII format from the Coursera Text Mining and Analytics course I completed in Summer 2015.

Hilary Clinton's controversial e-mails (3,015,552 words):
This data was conveniently preprocessed and distributed as an SQLite database by Kaggle. From the database, I pulled the 'RawText' field, which represents the raw text, including sender/recipient metadata, extracted from the officially released PDF documents. There were 7945 e-mails in the data set.

POS Tagging

The POS tagging was performed using the open source MeTA (ModErn Text Analysis) toolkit. I learned to use this toolkit during the Coursera Text Mining and Analytics course. For this exercise, I used the built-in greedy Perceptron-based POS tagger, which performs at an accuracy of about 97%. This tagger applies a popular tagging convention for American English called the Penn Treebank tag set. This tag set is comprehensive. It calls for dividing some POS into subclasses (e.g., personal versus possessive pronouns), and it also accounts for symbols, foreign words, cardinal numbers, etc.

Figure 1 summarizes the results from POS tagging the four document classes. The horizontal axis includes only five POS (nouns, verbs, pronouns, adjectives, and adverbs). The fractions of other POS are either small or do not vary widely enough with document category to be interesting, and they are excluded from further analysis. The vertical axis represents the fraction of each POS normalized to the total counts of nouns, verbs, pronouns, adjectives, and adverbs within the documents; all subclasses in the individual POS categories have been merged for this calculation. Finally, all documents in a category were concatenated together and then fed to the POS tagger, so the fractions reported represent global fractions over the document classes. Since N=1 in every category, there are no statistical error bars. Instead, I show the 3% error bars representing the stated accuracy of the POS tagging, which would exceed the statistical error bars on the mean POS fractions (see below).

Figure 1

Figure 2 is the same, except that the global fractions are shown relative to the classic novels.

Figure 2

From Figures 1 and 2, striking differences are visible between the fractions of nouns, verbs, and pronouns. Classic novels have a noun fraction around 30%, compared to Yelp reviews at 39%. Astrophysics papers and Hilary's e-mails have much higher noun fractions around 65%. Verb fraction in the classic novels is comparable to the noun fraction, but in the other document classes, verb fraction is significantly (>3%) lower; verb fractions in the astrophysics papers and Hilary's e-mails are just ~15%. It is also interesting that the fraction of pronouns varies widely across the four categories, with the astrophysics papers having a pronoun fraction of only 2.5%; this is sensible as only few pronouns (we, our, they, their) commonly appear in scientific writing. The fractions of adjectives vary less widely and fall in the range of ~8-12%. The adverb fraction is comparable (~11%) in classic novels and Yelp reviews; the lower adverb fraction (~4%) in the astrophysics papers and Hilary's e-mails makes sense because the fraction of verbs is lower too.

Based on these differences, we can begin to understand how the different document classes cluster together in the 5-dimensional space of the POS fractions. Classic novels have fewer nouns but more verbs and pronouns than Yelp reviews. Scientific articles and Hilary's e-mails are seemingly hard to distinguish from each other, but compared to the novels and Yelp reviews, they have very high noun fractions and low verb, pronoun, and adverb fractions. This all provides a glimmer of hope that machine learning could be applied to correctly distinguish short excerpts from each document class. A casual web search finds no examples of POS fractions being used to distinguish document types, so this extra step is worth trying.

Before proceeding, let's test that the global trends in POS fraction extend to shorter excerpts from the documents. I remeasured the POS fractions by slicing the document classes into excerpts of lengths 300, 500, 1000, and 1500 words. Additionally, I took care to filter out punctuation using regular expressions to avoid a bias in my word counting algorithm.

Figure 3 shows the result of this exercise for an excerpt length of 500 words. The mean POS fraction is annotated on each subplot in Figure 3. The mean fractions all agree to within 1% of the global means shown in Figure 1, and the standard errors on the mean values are <1% in all cases. Thus, the POS mean fractions derived globally and from the shorter excerpts are robust.

Figure 3

Table 1 shows the distributions in Figure 3 have scatter (1 sigma standard deviations) between ~2 and ~14%. The noun fractions in the astrophysics papers and Hilary e-mails have high ~12-14% standard deviations. Inspecting the document excerpts shows the very high noun fractions appear in the bibliography portions of astrophysics papers and in the headers of Hilary's e-mails. Increasing the excerpt length above 500 words reduces, but does not eliminate, the scatter because the POS fractions fluctuate less over larger chunks of words. This scatter is going to limit the accuracy of any trained classification system.

Document Classification

The goal in this section is to build a system that can correctly identify the document class of an excerpt from just a vector of the five POS fractions (noun, verb, pronoun, adjective, adverb). This constitutes a supervised learning problem because a model must first be trained and validated with example vectors and their corresponding category labels. A support vector machine (SVM) is an appropriate supervised learning method to use here; alternatives are logistic regression and neural networks. Training an SVM on vectors of POS fractions will define boundaries in POS-fraction space where the gaps between document categories are as wide as possible.

I proceed using the SVM implementation in the scikit-learn machine learning module for Python. Specifically, I use the Support Vector Classification (SVC) class, which has free parameters for the kernel, regularization (i.e., management of overfitting), and category weights. For the kernel, I use a Gaussian kernel with the 'gamma' parameter set to 1/n_features (i.e., 1/5).

Following the model validation technique presented in the Coursera Machine Learning course, I randomize the data (the POS vectors and their labels) into training, cross-validation, and test subsets in a 60/20/20% split. The document categories are of different sizes, and this means there are many more excerpts from the category with the most words (Hilary's e-mails) than the category with the fewest words (astrophysics papers). In all three subsets, I require the document categories to appear in proportions reflecting the relative sizes of the document corpuses (i.e., 18% classic novels, 19% Yelp reviews, 5% astrophysics papers, 58% Hilary e-mails). Likewise, I initialize the SVM with class weights inversely proportional to the category frequencies in the input data.

The SVM is trained with regularization parameters ranging from 1-1e5. The cross-validation set is used to pick the optimal regularization parameter, which is the one yielding the highest classification accuracy on the cross-validation set. Using instead the training set classification accuracy for this decision is bad practice, as it allows the training set to bias the final model. Finally, the generalized performance of the classifier is quantified as the test set classification accuracy.

The classification results are shown in Table 2 for excerpt lengths of 300, 500, 1000, and 1500 words. The size of the training set simultaneously decreases with increasing excerpt length because the same total number of words is divided into progressively larger blurbs. Despite the decreasing size of the training set, the overall classification accuracy (A_Overall) increases with excerpt length. As already noted concerning Figure 3, increasing the excerpt length reduces the scatter in the distributions of POS fractions, and it must be this effect that explains the increase in classification accuracy with excerpt length.

Classification accuracy within each class generally increases with excerpt length too, although there are instances where it stays about the same or decreases. Excerpts of 500 words are enough achieve classification accuracies of ~80% or higher in the four categories; increasing excerpt length to 1500 words yields accuracies of ~92% or higher. The classification accuracy for the Hilary e-mails (A_Hilary) is most affected by excerpt length, probably because this corpus shows the highest scatter in its distributions of POS fractions (Figure 3, Table 1). The astrophysics papers and the Hilary e-mails have similar POS fractions (Figures 1-2), yet the classifier is not confusing them catastrophically; this highlights the power of machine learning.

Conclusion

Different kinds of documents can contain disparate POS fractions. With machine learning, the document classes of short excerpts can be correctly identified from POS fractions alone. Reducing a document to a vector of POS fractions effectively discards specific knowledge of the entities and relations encapsulated in the words. It is striking that a high classification accuracy (Table 2) is still achievable after this conversion.

Document classification from POS fractions alone is a neat trick. However, I don't see it as practical. POS tags can be extra features within other powerful text mining algorithms that also directly incorporate the words in the document. In forthcoming posts, I will explore these and other advanced text mining algorithms.

For The Love Of Data Science

Sunday, April 17, 2016

Distinguishing Document Types With Part-of-Speech Fractions

Sample

POS Tagging

Document Classification

Conclusion