Sunday, May 15, 2016

How to Save the Earth With Word Association and Topic Modeling

Here's the situation. On the eve of your retirement from a three-letter Washington agency, ID4-esque spaceships appear over Earth's major cities. Your boss hands you a hard drive of millions of captured alien documents. Survival depends on your ability to understand the content of that hard drive. Sighing, you phone your wife and assure her the retirement steaks need not be put back in the freezer. It won't take too long wrap up this little setback.

Why?

Because you're an expert in text mining.

The above scenario calls for text mining. Specifically, efficient algorithms are needed to distill the contents of the captured documents into a format that can be more readily digested. Then specific documents will be identified and scrutinized to inevitably uncover whatever the fatal flaw is, this time, in the alien spacecraft, software, biology, etc.

In my previous text mining post, I represented documents as vectors of part-of-speech fractions. Here, I will discuss and demonstrate text mining algorithms that operate on words while ignoring word order and grammar. This bag-of-words model of text representation allows discovery of relations between words in the documents (word association) and the discovery of the broad themes and subjects in the documents (topic modeling).

Word Association


One way to discover word associations is to consider correlated occurrences. Words that have a high co-occurrence, yet rarely appear separately, likely have a syntagmatic relationship; words related in this way can be combined into meaningful phrases (e.g., 'steak' and 'grill'). Opinion analysis is one application of syntagmatic relation mining. Given a word representing an entity (e.g., 'Yankees'), one can mine the top syntagmatically related words to summarize what people write about that entity.

Information theory provides a way to quantify syntagmatic relationships. Entropy, in the context of information theory, represents the randomness of a word. In a 19th-century novel, "the" has a low entropy because it is certain to appear. The modern word 'airplane' likewise has a low entropy because it will not appear. Words like 'railroad', 'horse', or 'farm' have higher entropy because their occurrence depends on the details of the novel.

The probability that a word occurs in a document can be estimated by counting the number of occurrences divided by the total number of words; this is easy in the bag-of-words model representation. Figure 1 shows that entropy (H) is calculated from the probability of word w occurring. The variable $X_w$ takes on values (v) to indicate word w is present (1) or absent (0). The probabilities that w appears or not add to one.

Figure 1

Figure 2 further defines conditional entropy as the entropy of a word given we know whether a second word is present or not. By itself, the word 'horse' has a high entropy in a classic novel. Given that the word 'buggy' is present in a sentence, 'horse' is very likely to occur; the conditional entropy H('horse'|'buggy') is much lower than H('horse'). In comparison, consider the conditional entropy H('horse'|'is'). The two words are not related and the presence of 'is' does not change the randomness of 'horse'. As in these examples, it is generally true that the conditional entropy of one word given another is less than or equal to the entropy of the first word by itself.

Figure 2

In Figure 3, mutual information (I) is defined as the entropy reduction in a word given a second word is present or not. The difference between H('horse') and H('horse'|'buggy') is large, so the mutual information I('horse';'buggy') is high. Note that mutual information is nonnegative, is invariant to word order, and it is zero if two words are completely independent (e.g., 'horse' and 'is').

Figure 3

Mutual information gives a principled way to discover syntagmatic relationships. Mining a document corpus for the word pairs with the highest mutual information broadly indicates the contents of the documents. To demonstrate, I measure mutual information on a collection of 11 books (1,000,000+ words) from Project Gutenberg using the open source ModErn Text Analysis (MeTA) toolkit. I configured MeTA to preprocess the documents by removing stop words and by trimming word endings with the Porter2 stemming algorithm. Finally, the mutual information is measured on the resulting bag of words with code I wrote as part of the Coursera Text Mining course.

The 50 word pairs with the highest mutual information are shown in Figure 4. Many of the top word pairs are widely known fictional character names (e.g., 'holm sherlock' from The Adventures of Sherlock Holmes, 'sawyer tom' from The Adventures of Tom Sawyer, 'edward hyde' from The Strange Case of Dr. Jekyll and Mr. Hyde, 'miss woodhouse' from Emma, 'hare march' from Alice's Adventures in Wonderland). This immediately reveals the document corpus contains classic novels (indeed, ten novels plus one autobiography). Additionally, the word pairs reveal a diary belonging to a Dr. Seward (Dracula), that New York is a setting (The Autobiography of Benjamin Franklin), and that the documents mention various elderly individuals.

Figure 4: Word pairs with the highest mutual information

This simple analysis demonstrates how word association mining reveals the contents of documents. However, mutual information cannot tell us how pairs of correlated words are interrelated or which documents they are associated with.

Topic Modeling


A topic can be broadly defined as a theme or subject discussed in text data. Topics can be measured in different levels of granularity (sentences, articles, books). Documents in a collection may discuss multiple topics in different proportions. Topic mining simultaneously discovers topics in a document corpus and calculates the mixture of the different topics in each document. This makes the challenge of characterizing and understanding a large collection of documents less daunting. It's an important tool to have when you're at bat to save the world.

Probabilistic latent semantic analysis (PLSA) is one of the most basic topic modeling algorithms. In PLSA, documents are modeled as the mixture of topics. Every document has a different set of topic mixing weights. The topics are represented by unigram language models in which the probability of each word is independent. Each topic is a bag of words, containing every word in the vocabulary of the corpus, where each word is weighted by its probability. In a topic pertaining to sports, words like 'ball', 'team', 'player' would have high probability, while obviously unrelated words like 'spaceship', 'laser', and 'shark' would have infinitesimally small probabilities. The most probable words in the fitted topics are a useful summary of what the documents contain, but they do not detail every last nuance in the documents.

Figure 5 demonstrates how the probability of a word w in a document is calculated in PLSA. The probability of w is the weighted sum of the word's probabilities in all $\theta_j$ topic models, plus an additional background language model $\theta_B$, whose proportion $\lambda_B$ is fixed. The $\pi_{d,j}$ mixing weights determine the proportion of topic j in document d.

Figure 5

The goal of PLSA is to find the topic models and mixing weights that maximize the likelihood of observing all the documents in the corpus. The total number of parameters tracked is equal to the number of documents times the number of topics (i.e., the number of mixing weights) plus the number of words times the number of topics (because every word has a different probability in each topic); the number of parameters grows linearly with the size of the document corpus. The log-likelihood of the documents is given by Figure 6. The log terms make this optimization problem intractable. The iterative expectation-maximization (EM) algorithm is a viable method for solving the optimization.

Figure 6

As part of the Coursera Text Mining course, I had to implement the PLSA + EM algorithms. I reuse this code to demonstrate topic modeling on the 11 books from Project Gutenberg. The limitations of unigram language models mean that trying to fit many topics to a novel will not elucidate the various scenes or distinct segments in the book. I hypothesized that PLSA can at least capture a unique, dominant topic from each book. Therefore, I ran PLSA topic modeling on this corpus with k=11 topics for 200 iterations, assuming a background word distribution of 80%. The algorithm ran within about 10 seconds and mostly converged by iteration 50; not bad for fitting 11 topics to 1,000,000+ words on vintage-2012 hardware.

Figure 7 shows the top 20 words, ordered by descending probability, in each of the 11 topics. There are 24 words shared between multiple topics (e.g., 'father', 'boy', 'sister', 'it'); note, however, that positions of shared words in the lists vary because they have different probabilities in different topics. Several of the topics contain recognizable character names, indicating that certain topics are strongly associated with certain books (e.g., topic 1 for Dracula, topic 3 for The Adventures of Tom Sawyer, topic 7 for The Adventures of Alice in Wonderland, topic 8 for Emma, topic 10 for Frankenstein, topic 11 for The Autobiography of Ben Franklin). For other topics, it is not clear from Figure 7 alone (to me, at least) which book(s) they represent.

Figure 7

Topic 6 contains references to both The Adventures of Sherlock Holmes and The Strange Case of Dr. Jekyll and Mr. Hyde. This implies there is at least some mixing of documents within topics occurring. The degree of topic mixing is indicated by the mixing weights in the PLSA algorithm. Figure 8 shows the mixing weights, which are normalized so that summing over a fixed document yields one. As suspected, Topic 6 is a composite of The Adventures of Sherlock Holmes and The Strange Case of Dr. Jekyll and Mr. Hyde. Topics 2 and 5 represent Great Expectations. All other topics are devoted to a distinct book. It is noteworthy that PLSA finds separate topics for The Adventures of Tom Sawyer and The Adventures of Huckleberry Finn, which share the same author and have overlapping characters.

Figure 8

Hilary's Emails


In this last section, I explore topic modeling in a scenario with many documents where the number of topics is unknown. A suitable corpus to explore this matter is the Hilary Clinton email data set of 7945 emails (3+ million words) released by Kaggle. From the Kaggle database, I extracted the 'RawText' field, which represents the raw text, including sender/recipient metadata, scraped from the officially released PDF documents.

Thinking briefly about the contents of the emails helps develop an informed text mining strategy. The email data set probably addresses multiple subjects. Further, given the nature of email, we can anticipate each message will be narrowly focused so that a typical email does not represent all topics equally. Finally, the subjects will be highly specific; the most probable words in a topic will in general not be highly probable in other topics. These are well-justified beliefs, or priors, about how the text is structured.

The vanilla PLSA implementation I used above does not incorporate priors about the distribution or contents of the topics. PLSA worked well in the above section despite this, perhaps because there was a strong justification for choosing a specific number of topics.

Latent Dirichlet allocation (LDA) is a Bayesian version of PLSA that includes such priors. In LDA, Dirichlet priors $\alpha$ and $\beta$ influence the per-document topic distributions and the per-topic word distributions, respectively. The range of $\alpha$ and $\beta$ is (0,1]; small values indicate few topics per document or few words per topic. PLSA is equivalent to LDA with uniform Dirichlet prior distributions (i.e., $\alpha$ and $\beta$ set to one). Computationally, LDA is more sophisticated than PLSA, but, apart from priors, it has the added advantage that it is less prone to overfitting because fewer parameters are fitted. LDA performs at least as well as PLSA.

Given the size and complexity of the email data set, using LDA over PLSA may yield better results. Fortunately, MeTA contains efficient implementations of LDA. I chose to run the MeTA flavor of LDA that uses collapsed variational Bayes with $\alpha$ and $\beta$ fixed to 0.3. I ran the algorithm several times to explore the effect of fixing different numbers of topics (2, 4, 6, and 8). Each run took several minutes and hundreds of iterations to converge. Increasing the number of topics required more time and iterations. The eight-topic model took about an hour and 760 iterations to finish. I then evaluated the effectiveness of the topic modeling by studying the top 50 words in each topic for each run.

The results are encouraging and seem to effectively capture distinct themes. In the two-topic model, one emergent theme is email metadata; among the top words are 'subject', 'origin', 'date', 're', 'cc', days of the week, months, and email addresses (e.g., 'hdrclintonemailcom'). The second topic is less specific and seems to roll many topics regarding international affairs into one.

The four-topic model captures more granular themes: 1) email metadata, as before; 2) the Near and Middle East ('benghazi', 'libya', 'israel'); 3) foreign aid ('haiti', 'pakistan', 'usaid', 'women', 'food', 'support'); 4) politics, world leader names ('obama', 'democrat', 'labour', 'voter', 'koch').

In the six-topic model, the topic about the Near and Middle East is split so that words concerning Libya, Benghazi, and Egypt have their own topic.

Figure 9 shows the top 40 words for each topic in the eight-topic model. In my interpretation, topics 1-5, and 7 cover the Middle East, US politics, Mexico/South America/Ireland, Libya/Benghazi/Egypt, email metadata, and foreign aid respectively. The meanings of topics 6 and 8 are less clear to me. Topic 6 ranks highly words like 'arriv', 'meet', 'depart', and 'hotel', so it may refer to the logistics of travel or arranging meetings. Topic 8 is a mixture of words concerning international politics ('obama', 'tori', 'cameron') and military matters ('war', 'mcchrystal', 'militari').  Fitting eight topics gives a reasonable representation of the email contents, although fitting more may provide further refinement.

Figure 9

Viewing the mixing weights reveals how the fitted topics are distributed in the emails. Figure 10 shows the distributions of mixing weights, arranged by topic. Each subplot represents all 7945 emails. All topics have mixing weights spanning the range (0,1). For all topics, there are at least ~40 messages where that topic's mixing weight is above 0.8.

Figure 10

The meaning of the topic models can be further scrutinized by manually inspecting emails with particular topic weights. Inspection shows, for instance, that the topic 5 weights indicate how much message text there is relative to email metadata. Messages with high $\pi_{d,6}$ weights all contain itemized daily schedules, validating my suspicion that topic 6 concerns meeting logistics. The messages most closely related to topic 8, which I was also previously unsure about, address UK politics or UK-EU-US relations in various contexts (e.g., the Middle East, climate change, China).

Conclusion


Representing text data as a bag of words facilitates powerful text mining algorithms that can efficiently reveal the nature of unknown documents. Word association mining uses mutual information to pick out words appearing in phrases together. Topic mining goes a step further to associate words with distinct subjects and themes. The effectiveness of PLSA and LDA topic modeling with unigram language models has been demonstrated here. LDA is a Bayesian version of PLSA that incorporates priors while also being less prone to overfitting. These advantage make LDA a pragmatic choice for scenarios with many unknown documents, such as when you are burdened with the task of saving the world.