For The Love Of Data Science: 2016

Saturday, September 24, 2016

Graph Analysis of the Leaked Democratic National Committee Emails

Graphs are constructs used to model the pairwise relationships between objects. Graph theory, the organized study of graphs, is a rich topic with applications to many problems in the physical, biological, social, and information sciences. In this first post on graph theory, I use a graph to visualize the sender-recipient relationships within the emails stolen from the Democratic National Committee (DNC) and published in searchable form on WikiLeaks. The details of how I retrieved the emails, parsed them, and created the graph are summarized. I also quantitatively characterize the graph and discuss community detection within the network. Finally, I end with a demonstration of the R Shiny application I have deployed for playing with this dataset. The Shiny application source code plus the final graph are available on Github.

Email Extraction, Transformation, and Loading into Graph Format

The emails are stored in EML format at addresses of the form $\texttt{https://wikileaks.org/dnc-emails/get/i}$ where $\texttt{i}$ is an integer between 1 and 22,456. The emails were downloaded one-by-one from WikiLeaks using the powerful tool $\texttt{curl}$. The downloader script I prepared is available on Github. It took about 6 hours to retrieve all 22,456 emails (1.7 GB).

The emails were parsed using the $\texttt{email.parser}$ package built into Python. Given a file object for an email, the $\texttt{HeaderParser}$ class returns the header data in key-value pairs. I extracted the sender, recipient(s), and subject line from the header of each email and then did some text preprocessing (e.g., converting letters to lower case, stripping white space) before writing this information into ASCII files (dnc_subject-out.tab, dnc_from-out.tab, dnc_to-out.tab) available on Github.

While the subject lines will not be used in the graph analysis, it is interesting to briefly consider them because they provide clues about the email contents. A wide variety of topics, including some surprising ones, are discussed. Figure 1 lists subject lines that caught my attention.

Figure 1: A sample of subject lines in the DNC emails.

The DNC emails are described on WikiLeaks as having come from seven email accounts. The numbers of unique email senders (1192) and recipients (1169) in this email dataset, however, are much larger than seven! The total number of unique identifiers among all senders and receivers is 1777. This includes two identifiers labeled as 'None' and 'undisclosed-recipients' representing one or more unknown email addresses. Some of the identifiers are phone numbers or generic labels like 'wireless caller'.

The details of email communication between all unique email addresses can be represented in an NxN matrix $M$ such that the value of $M_{ij}$ is the number of emails $i$ sent to $j$. If $j$ wrote back, then $M_{ji}>0$. Cells along the diagonal ($M_{ii}$) are positive and nonzero if people email themselves. $M$ intuitively translates into a directed graph. Every cell $M_{ij}$ represents a potential graph connection (or edge) from vertex $i$ to vertex $j$, which only exists if $M_{ij}>0$. If $M_{ji}>0$ also, then a parallel edge exists from vertex $j$ to vertex $i$.

For simplicity, I chose to represent the DNC email network as an undirected graph. This means an edge between vertices $i$ and $j$ exists if $M_{ij}>0$ or $M_{ji}>0$. I discarded the emails where no sender (N=144) or receiver (N=223) information was available (i.e., emails having 'None' or 'undisclosed-recipients' as the sender or recipient), which removed 26 vertices from the graph.

There are various graph analysis packages in Python (e.g., graph-tool, igraph, NetworkX). I chose igraph because it is simultaneously available in R as well as Python. The python-igraph interface is intuitive to use, and it is quite fast for my purposes.

Taking advantage of igraph support for vertex and edge attributes, I assigned each vertex a 'name' attribute to store an email address string. I also gave the edges an attribute ('dir') to represent one-way versus two-way communication and another attribute ('mag') to represent the number of emails exchanged. There are 1751 vertices and 3976 edges in the final graph, which is available on Github in GraphML format. One can immediately explore the DNC email network by importing this graph into his or her favorite graph analysis tool. My Python scripts on Github (mkGraph.py, analysis.py) demonstrate how to interact with the graph.

Graph Visualization and Community Analysis

The first step in appreciating the complexity of the DNC email network graph is visualization. Fortunately, igraph provides a powerful but straightforward plotting interface. With experimentation, I found force-directed algorithms produced the most aesthetically pleasing result. Figure 2 shows the graph as drawn with the Fruchterman Reingold force-directed layout. Each circle is a vertex representing a unique email address. Gray edges mean one-way email communication while orange edges indicate two-way communication. The center of the graph is crowded with many crossing edges. At intermediate radii are fan-like structures, or "communities", in which many nodes that share a common email connection are sparsely connected to outside nodes. At the outskirts are nodes that have only one email connection or that just email themselves, creating a loop.

Figure 2: The DNC email network plotted with a force-directed layout algorithm. Gray edges mean one-way email communication and orange edges mean two-way communication. Vertices that are slightly enlarged have degree $\geq 125$ and are discussed in the text below.

Next, let's consider the basic properties of the vertices and edges. Vertex degree refers to the number of edges adjacent to a vertex (note, a loop adds 2 to vertex degree). The histogram of vertex degree has a steep drop at low degree and a long tail at higher degree. The majority of vertices (1127/1751, 64.4%) have degree$=1$. Significantly less (17.1%) have degree$=2$. About 9.7% have degree$\geq7$, and the top 5.0% have degree$\geq18$. The top three vertex degrees are 476 (comers@dnc.org), 471 (mirandal@dnc.org), and 274 (kaplanj@dnc.org).

Figure 2 gives the impression that most email communication is unidirectional. Indeed, 64.4% (2641/3976) of edges in Figure 2 are gray. Figure 3 gives insight into the number of emails typically exchanged (either way combined) between individuals. The majority of email exchanges are relatively brief. About 35.9% of graph edges signify the exchange of a single message. Approximately 66.9% of graph edges represent an exchange of $\leq4$ messages. Roughly 9.9% of edges correspond to email exchanges $\geq23$ messages. The top three email exchanges involve comers@dnc.org mailing himself (512 messages), mirandal@dnc.org and paustenbachm@dnc.org mailing each other (665 messages), and dncpress@dnc.org mailing itself (1492 messages).

Figure 3: The cumulative fraction of graph edges as a function of emails sent per edge in the DNC email network graph. The bins are five emails per edge wide.

Figure 4 is a similar plot showing how the cumulative fraction of confined edges increases with the maximum allowed vertex degree. The curve in Figure 4 rises smoothly initially but then begins to show step features corresponding to the community structures in Figure 2. The community structures arise because vertices of intermediate and high degree connect to many vertices that are in turn sparsely connected. A significant fraction (34.1%) of the graph's edges are connected to one or more of the five vertices with degree $\geq 125$. The legend of Figure 4 lists the persons associated with these most influential vertices, and the same five vertices are also distinguished in Figure 2 as being slightly enlarged and having vertex colors matching the lines in Figure 4.

Figure 4: The cumulative fraction of edges originating and terminating within vertices as a function of vertex degree. The vertical lines mark the five highest-degree vertices to which 34.1% of the graph's edges connect.

While Figure 4 associates the most highly connected nodes with separate communities, a more sophisticated approach is required to find all communities and determine their member nodes. The study of community structure in graphs is an active field, and there are many algorithms implemented in igraph for this purpose. From the available algorithms in igraph, I have chosen the Louvain Modularity (LM) method. Modularity is a scalar measure of graph structure; positive modularity means vertices are more densely interconnected than expected by chance. LM partitions communities in a bottom-up, hierarchical way that maximizes the modularity of the graph. LM is suitable even for networks with over 100 million nodes, and it is fast and effective compared to competing algorithms.

The LM algorithm finds 47 communities in the DNC email network and reports a modularity of 0.6 (on a scale of [-1, 1]). Over half (36/47) of the communities have just one or two members, and these small communities are the isolated vertices in the graph outskirts. The other 11 communities have between 5 and 429 members. Figure 5 redraws the graph with color coding for the vertices in the 11 largest communities. As intuitively expected, the largest fan-like structures are separate communities. There are also multiple communities coexisting in the center of the graph where it was hard to see structure in Figure 2.

Figure 5: An alternate rendering of Figure 2 with color coding for the 11 largest communities as discovered by the Louvain Modularity algorithm. The black vertices at the outskirts represent small communities with only one or two members. The enlarged nodes have the highest vertex degree within their communities.

Table 1 lists the 11 largest communities in descending order of the number of community members. The second column names the person with the most email connections (i.e., highest vertex degree) in each community; these people may perhaps be considered as "community leaders", and they appear as enlarged nodes in Figure 5. The third column of Table 1 indicates the color of the community in Figure 5. The five vertices with degree $\geq125$ highlighted in Figure 4 turn out to be community leaders in Table 1. The number of nodes per community roughly, but not perfectly, increases with the vertex degree of the community leader.

This graph analysis has demonstrated how to find substructure in a graph and further how to identify the most prolific individuals within network communities.

Interactive Shiny Web Application

Shiny is a web application framework for the R programming language. It is powerful while still being simple to use, and its attractive widgets make for very nice looking applications. I have designed a Shiny application that provides interactive visualization of the DNC email network. This application incorporates the R version of igraph and uses the qgraph package for plotting. My approach was to prototype all the intended functionality with python-igraph first, because it is more intuitive to use, and then port it over to the R version. It is worth noting that the igraph R interface is slightly different than python-igraph. I also highly recommend qgraph over the R base graphics for plotting graphs.

The layout of the application consists of a sidebar panel with widgets and a main panel area where the graphs are drawn. The sidebar (Figure 6) consists of two drop-down menu ($\texttt{selectInput}$) widgets and one checkbox group ($\texttt{checkboxGroupInput}$) widget.

Figure 6: The sidebar panel in the Shiny application contains drop-down menus and checkboxes that control three separate plots.

Graph A shows the subgraph where all vertices have degree $\leq$ the value in the first drop-down menu (i.e., it's like an interactive representation of Figure 4). Gradually building up the graph by raising the value in the drop-down menu is useful for viewing small-scale structure that is obscured in the full graph. Similarly, Graph B shows the subgraph where all vertices have degree $\geq$ the value in the second drop-down menu; this serves to isolate the relationships between the most highly connected nodes. Graph C allows the user to select via checkboxes a subset of the 50 vertices with the highest degree. All email connections of the selected vertices are shown, allowing one the see the email network for specific individuals. Each subgraph plot is annotated with the number of vertices, edges, and the modularity. Figures 7a and 7b show examples of the graphs produced in the application.

Figure 7a: Sample output from the R Shiny web application. Graph A shows vertices with degree $\leq 28$. Similarly, Graph B shows vertices with degree $\geq 274$.

Figure 7b: Graph C shows the email network of comers@dnc.org, allenz@dnc.org, houghtonk@dnc.org, and brinsterj@dnc.org.

This application has been deployed on shinyapps.io. The source code is available on Github.

Summary

In this post, I have provided a simple analysis of the email network inferred from the leaked DNC emails. Transforming the sender-recipient information into a network graph is a straightforward process once the emails are in hand. Analyzing the substructure of the resulting graph reveals 11 communities within the DNC email network containing 5-429 individuals. Scrutinizing the community membership also reveals the most prolific participants, or leaders, within each community. Finally, I have demonstrated the utility of the R Shiny framework for data science web applications with an example that is sure to provide some amusement this election season.

Sunday, July 31, 2016

Naive Bayes Text Classification of Classic Literature and Job Advertisements

Categorization of documents into predefined categories is an important aspect of text mining and analytics. Text classification facilitates knowledge discovery about internal (e.g., content) and external (e.g., author) document attributes. A multipurpose classification system could distinguish the product families addressed in a set of online reviews while further characterizing the opinion and sentiment of each author. Humans can manually classify text according to sets of rules, but this process will become tedious and potentially inconsistent for all but the simplest documents.

Automating text classification is an exercise in machine learning. Given a training set of text documents with human annotated categories, text classification algorithms learn through adjusting weights on discriminative text features to minimize the training error. In general, this process is not fully automatic because human participation is needed to prepare and label the training data and to define the features used by the algorithms.

A variety of machine learning algorithms can be used for text classification (e.g., naive Bayes, logistic regression, support vector machine, k-nearest neighbors). The methods tend to differ in how the training errors are measured, and in how they combine features together (e.g., linearly versus non-linearly). Classification performance depends more on having good text features than on the details of the algorithm, so there is no clearly dominant classification algorithm.

In this post, I will discuss the naive Bayes algorithm and demonstrate its performance on classic literature and job advertisements. I find naive Bayes performs extremely accurately (>98% average accuracy) on passages from books. Classification of job advertisements works well (91% average accuracy for tech jobs) and has potential utility in human resource (HR) departments and on job aggregator websites.

The Naive Bayes Algorithm

Generative models learn how the data "looks" in each category. Naive Bayes in particular learns the conditional probabilities of words in each text category, and then picks the most likely category based on a document's words.

Consider a document d having category $\theta_i$. Bayes theorem tells us the identity in Equation 1 is true. It says the probability of category $\theta_i$ given d, $p(\theta_i|d)$ (i.e., the posterior), is proportional to the probability of obtaining category $\theta_i$, $p(\theta_i)$ (i.e., the prior), times the likelihood function $p(d|\theta_i)$ of the document being generated given category $\theta_i$.

Equation 1

Finding the correct category of a document is a matter of maximizing the posterior probability. Naive Bayes maximizes the product of $p(\theta_i) \times p(d | \theta_i)$. In contrast, discriminative classifiers like logistic regression, support vector machines, and k-nearest neighbors directly calculate the posterior.

Equation 2 further breaks down the naive Bayes algorithm. Making the naive assumption that words in documents are independently generated (which works well in many applications), the likelihood $p(d|\theta_i)$ can be rewritten in terms of a product over all words in the global document corpus vocabulary $\left( \prod_{w \in V} \, p(w|\theta_i)^{c(w,d)} \right)$. This introduces another conditional probability, $p(w|\theta_i)$, representing the probability that a word $w$ is drawn from category $\theta_i$. Finally, the logarithm of the posterior probability is used in practice to preserve numerical precision because the calculation involves the multiplication of many small probabilities.

Equation 2

A naive Bayes classifier requires a training set in which documents are labeled with their
corresponding categories (Equation 3).

Equation 3

From the training set, the algorithm calculates $p(\theta_i)$ and $p(w|\theta_i)$ for each category. The number of documents need not be the same in each category as $p(\theta_i)$ accounts for unequal category probabilities; $p(\theta_i)$ is equal to the number of documents in category i divided by the total number of documents in the training set (Equation 4).

Equation 4

The text categories in naive Bayes are represented as bags of words in which the words are weighted by probability of appearance. These conditional word probabilities uniquely define each text category. Given the word probabilities for a category as calculated by naive Bayes, one could in principle generate documents in that category; this is why naive Bayes is considered a generative model. Equation 5 defines the conditional probability $p(w|\theta_i)$. The numerator is the count of the word in the training set for category i. The denominator represents the total number of words in all training set documents.

Equation 5

The categories in text classification are analogous to the topic models in topic modeling because topic models are likewise represented with probability-weighted bags of words. Compared with topic modeling, however, text categorization with naive Bayes is a simpler problem. In topic modeling, the topic-specific conditional word probabilities and mixing weights for each document-topic combination are fitted parameters. In naive Bayes text categorization, only the conditional word probabilities for the k categories are tracked, and the categories are predefined in the training set.

Classifying Book Excerpts

In this section, I test how well the naive Bayes algorithm can match short excerpts of text with the books they come from. This test is simple in part because no manual classification is needed to construct the training set, yet it is versatile because the length of the excerpts is readily changeable. Using excerpts of longer word count reduces the training sample size, allowing study of how the tradeoff between excerpt length and training set size affects performance.

My test uses 10 classic novels plus one autobiography (1,000,000+ words total) from Project Gutenberg. I've used this dataset in my previous text mining projects, here and here. I run separate trials in which the books are each partitioned into excerpts of length 250, 400, or 550 words. These particular numbers were chosen to match the lengths of typical job advertisements (the 25th, 50th, and 75th percentiles of the job advertisements I use below are 276, 404, and 563 words, respectively). Table 1 shows the number of excerpts provided by each book for the different excerpt lengths. After the excerpts are created, they are converted to lower case letters, filtered for stop words, and words are trimmed with the Porter2 stemming algorithm.

Text classification is performed on all 11 books simultaneously with the implementation of naive Bayes in the open source MeTA (ModErn Text Analysis) toolkit. For each excerpt length (250, 400, and 550 words), a classifier is trained and tested with 5-fold cross-validation in which the input is equally divided into five folds. A classifier is trained on four parts and tested on the remaining one. This process iterates so that each separate fold is tested against a classifier trained on the other four. The average performance across all five iterations is a robust measure of classifier performance because every excerpt is used in the evaluation phase.

The results are surprisingly good for an excerpt length of 250 words, with an overall classification accuracy of 98.6% based on 4490 predictions. Accuracy alone, however, is not the only useful metric of machine learning performance. Table 2 shows the confusion matrix in which the rows represent the actual class of the excerpts while the columns represent the predicted class. Each row shows the fraction of excerpts from a given book that the classifier assigned to each possible category. Cells along the diagonal indicate the rate of true positives for each book, a metric referred to as recall in the context of machine learning. The recall for each book varies between 100% and 93.3%. The three books with the lowest recall are Dracula (97.7%), Strange Case of Dr. Jekyll and Mr. Hyde (94.2%), and The Adventures of Tom Sawyer (93.3%).

Excerpts from Dracula are confused with several books including Alice in Wonderland (0.15%, 1/657), Autobiography of Benjamin Franklin (0.46%, 3/657), Emma (0.15%, 1/657), Great Expectations (0.30%, 2/657), Grimms' Fairy Tales (0.15%, 1/657), Strange Case of Dr. Jekyll and Mr. Hyde (0.15%, 1/657), and The Adventures of Sherlock Holmes (0.91%, 6/657). In contrast, excerpts from Strange Case of Dr. Jekyll and Mr. Hyde are confused only with The Adventures of Sherlock Holmes at a rate of 5.8% (6/104).

Excerpts from The Adventures of Tom Sawyer are mistaken for Frankenstein (1.3%, 4/298), Great Expectations (0.03%, 1/298), and Adventures of Huckleberry Finn (5.0%, 15/298). It is perhaps not surprising the algorithm has difficulty separating the two Mark Twain novels given they share overlapping characters, vocabularies, and writing styles. However, the confusion is one-sided in that only 0.8% (4/471) of excerpts from Adventures of Huckleberry Finn are misidentified as belonging to The Adventures of Tom Sawyer. This is probably because the Adventures of Huckleberry Finn category, being based on a longer book (471 versus 298 equal-size excerpts), has a higher probability of occurrence than the The Adventures of Tom Sawyer category.

Precision, the fraction of correct output category declarations, is another useful metric to describe classification performance. Whereas recall is normalized in terms of the true number belonging to a category, precision is normalized in terms of the number of times a category was selected. High precision means few false positives, and vice versa. Table 3 shows the precision (and recall for comparison) for excerpt length 250 words. Precision and recall are similar in most cases. The Adventures of Tom Sawyer has the largest disparity in precision and recall (98.6% versus 93.3%). The recall is lower because parts of The Adventures of Tom Sawyer are confused with other books, particularly Adventures of Huckleberry Finn. In turn, Adventures of Huckleberry Finn has relatively low (96.9%) precision because of the false positives from The Adventures of Tom Sawyer.

Increasing the excerpt word count to 400 raises the overall classification accuracy (to 99.3% for 2812 predictions) and improves the precision and recall for each book. This performance improvement occurs despite there being fewer excerpts in each category (Table 1). Note, the number of words presented to the classifier for each book is not changing. The books are still being presented in full to the classifier, just in fewer, larger chunks. Likewise, the classifier is still being trained on 80% of the book and is being tested on the other 20% (via 5-fold cross-validation). The extra words per excerpt apparently give the algorithm a stronger hint of the correct category.

Performance is better still when the excerpt word count is 550. The overall accuracy is 99.5% on 2045 predictions. The precision and recall in each category (Table 4) are better than 95%, and three books have both precision and recall equal to 100%. Some confusion between The Adventures of Tom Sawyer and Adventures of Huckleberry Finn still exists, however.

Text classification works very well on book excerpts. The 11 books were written by 10 different authors. Each author has a unique writing style and vocabulary, and naive Bayes seems able distinguish these differences with the $p(w|\theta_i)$ conditional probabilities it calculates.

Classifying Job Advertisements

Here, I apply naive Bayes text classification to the more challenging, and practical, scenario of job advertisement classification. A job advertisement (or posting) should sell the company and position to a qualified job seeker. Ideally for hiring managers, the advertisements are sexier than the dryer, more technical job descriptions maintained internally by HR, although many job postings are uninspired. Of concern is that some jobs seem to have arbitrary or silly job titles, and certain job titles are used inconsistently by companies (e.g., software engineer versus web developer). Given that these problems are bound to be reflected in job advertisements, I want to know whether naive Bayes can learn the difference between various job roles from typical job postings available on the web. If so, there are sure to be applications within HR departments and on job aggregator websites.

I assemble job advertisements relevant to four job categories: DevOps (a relatively new job role that intersects traditional software development, quality assurance, and information technology operations roles), software engineering, web development, and data science. In principle, these roles imply different expertise. In practice, however, jobs in these categories can share some overlapping technical skills (e.g., programming languages, databases, user interface development). For example, web development can involve building the user interfaces of websites, software engineering can involve the user interfaces of desktop or mobile applications, and data science can involve creating
dashboards. Additionally, some software engineering applications incorporate tools also used by data scientists, like machine learning algorithms, databases, and Hadoop. Commonalities like these blur the boundaries between different job roles and potentially make job classification less certain. I deliberately built my case study to be challenging in this way.

Job postings in the four categories were retrieved from a job aggregator service. Specifically, I scraped the results of separate searches for "software engineer", "web developer", "DevOps", and "data scientist" in various major cities. From the accumulated results, I inspected the job titles for relevance and manually removed nonrelevant job postings. Duplicates in the remaining set of viable jobs were identified and removed. This is an important step to prevent copies of the same postings from simultaneously appearing in the training and test datasets. Collapsing the job postings into vector counts of words allowed duplicates to be efficiently identified with vector operations.

Before running the classifier, I enforced specific criteria on the postings to ensure the separate categories were consistently defined. The words "DevOps" and "web" were required to appear in the job titles of postings in the "DevOps" and "Web Developer" classes, respectively. Likewise, job titles in the "Data Science" category all include the word "data", and almost all have the word "scientist" too (the remaining few say "engineer" or "analyst" in lieu of "scientist"). Software engineering is a more diverse field, and the related titles are not as uniform as in the previous three categories. In addition to titles specifying "software engineer", I included in the "Software Engineer" group titles like "software developer", "Java developer", "C++ coding", "general programmer", "game programmer", and "mobile development evangelist", among others. Postings with titles that obviously prioritized web development (e.g., full stack web developer) were not allowed. All jobs specializing in software quality assurance were excluded to further restrict the scope of the Software Engineer category. Application of these criteria yielded 111 to 764 jobs per category. Table 5 shows the final number of postings in each category.

As a sanity check on the consistency of my label annotations, I ran the text classifier on the job titles. Classification was performed as in the previous section using five-fold cross-validation. The confusion matrix in Table 6 indicates the classifications are 100% perfect, implying my job-labeling scheme is self-consistent. Classification of job descriptions is not expected to work so perfectly, but this test shows the labeling done by me cannot be a source of error.

Classification of the job description portion of the advertisements yields an overall accuracy of 91%, based on 1680 predictions. This is encouraging considering I chose a hard scenario (tech jobs). Table 7 shows the confusion matrix. The recall for the DevOps, Software Engineer, and Web Developer categories ranges from 84.4% to 86%. In contrast, recall in the Data Science category is 98.4% because this category is more easily distinguishable from the others. About 12.8% (14/111) of DevOps jobs were confused for Software Engineer jobs, but only 1.8% (2/111) were confused for Web Developer jobs. Software Engineer jobs were called DevOps, Web Developer, and Data Science jobs at rates of 2.7% (16/590), 10.7% (63/590), and 2.2% (13/590), respectively. Few Web Developer jobs were considered DevOps (0.45%, 1/222) or Data Science (0.9%, 2/222), yet 12.7% (28/222) were designated Software Engineer. Among Data Science jobs, 0.92% (7/764) were classified as Software Engineer, and 0.66% (5/764) were deemed Web Developer.

Table 8 shows the precision (with recall for comparison) in each category. Precision is comparable to recall in the DevOps (~85%) and Data Science (~98%) classes. There is a disparity, however, for Software Engineer and Web Developer. Software Engineer has precision greater than recall (0.91 versus 0.84). A classification of Software Engineer has a 91% chance of being correct, yet, due in large part to confusion with the Web Developer class, classification is only correctly identifying 84.4% of all Software Engineer jobs. Web Developer has rather low precision (0.73) because of the number (63) of Software Engineer jobs being called Web Developer is a sizeable fraction of the number of correct Web Developer predictions (191).

In this experiment, the confusion between DevOps, Software Engineer, and Web Developer jobs, related roles with overlapping skills, is of order 10%. One component of this error could be mismatches between the job descriptions and their provided job titles, the latter of which I used as the basis for my own labeling. Reinspection of the misclassified job descriptions may indicate that, in some cases, the predicted classification is actually more appropriate for the job description than the one I gave given the provided job title. In that case, the error rate of this job classification experiment is lower than stated. I do not have the expertise to assess this possibility, however.

Conclusion

In this post, I have applied naive Bayes text classification to classic literature and job advertisements. Book excerpts represent an ideal scenario because there is no uncertainty on the correct classification of text. In this experiment, the overall accuracy exceeds 98% on excerpts of length 250 words. Increasing the size of the excerpts raises the accuracy despite shrinking the size of the training set.

Classification of job advertisements is a harder problem because of similarities in different job functions and also because job titles are not always applied correctly. Classification of four related technical roles yielded an overall accuracy of 91%. There was modest confusion, of order 10%, between job roles. If this experiment were repeated using unrelated job categories (e.g., administrative, engineering, finance, sales), the accuracy would be even higher.

The automated classification of job postings has potential applications. HR departments could classify their own job descriptions relative to a database of company/industry jobs in order to select the correct job title. Job aggregator websites might classify their job postings and make such value-added results searchable to users. Finally, employee resumes could be classified by a model trained on job descriptions to assess how well a candidate fits with the available roles.

Sunday, May 15, 2016

How to Save the Earth With Word Association and Topic Modeling

Here's the situation. On the eve of your retirement from a three-letter Washington agency, ID4-esque spaceships appear over Earth's major cities. Your boss hands you a hard drive of millions of captured alien documents. Survival depends on your ability to understand the content of that hard drive. Sighing, you phone your wife and assure her the retirement steaks need not be put back in the freezer. It won't take too long wrap up this little setback.

Why?

Because you're an expert in text mining.

The above scenario calls for text mining. Specifically, efficient algorithms are needed to distill the contents of the captured documents into a format that can be more readily digested. Then specific documents will be identified and scrutinized to inevitably uncover whatever the fatal flaw is, this time, in the alien spacecraft, software, biology, etc.

In my previous text mining post, I represented documents as vectors of part-of-speech fractions. Here, I will discuss and demonstrate text mining algorithms that operate on words while ignoring word order and grammar. This bag-of-words model of text representation allows discovery of relations between words in the documents (word association) and the discovery of the broad themes and subjects in the documents (topic modeling).

Word Association

One way to discover word associations is to consider correlated occurrences. Words that have a high co-occurrence, yet rarely appear separately, likely have a syntagmatic relationship; words related in this way can be combined into meaningful phrases (e.g., 'steak' and 'grill'). Opinion analysis is one application of syntagmatic relation mining. Given a word representing an entity (e.g., 'Yankees'), one can mine the top syntagmatically related words to summarize what people write about that entity.

Information theory provides a way to quantify syntagmatic relationships. Entropy, in the context of information theory, represents the randomness of a word. In a 19th-century novel, "the" has a low entropy because it is certain to appear. The modern word 'airplane' likewise has a low entropy because it will not appear. Words like 'railroad', 'horse', or 'farm' have higher entropy because their occurrence depends on the details of the novel.

The probability that a word occurs in a document can be estimated by counting the number of occurrences divided by the total number of words; this is easy in the bag-of-words model representation. Figure 1 shows that entropy (H) is calculated from the probability of word w occurring. The variable $X_w$ takes on values (v) to indicate word w is present (1) or absent (0). The probabilities that w appears or not add to one.

Figure 1

Figure 2 further defines conditional entropy as the entropy of a word given we know whether a second word is present or not. By itself, the word 'horse' has a high entropy in a classic novel. Given that the word 'buggy' is present in a sentence, 'horse' is very likely to occur; the conditional entropy H('horse'|'buggy') is much lower than H('horse'). In comparison, consider the conditional entropy H('horse'|'is'). The two words are not related and the presence of 'is' does not change the randomness of 'horse'. As in these examples, it is generally true that the conditional entropy of one word given another is less than or equal to the entropy of the first word by itself.

Figure 2

In Figure 3, mutual information (I) is defined as the entropy reduction in a word given a second word is present or not. The difference between H('horse') and H('horse'|'buggy') is large, so the mutual information I('horse';'buggy') is high. Note that mutual information is nonnegative, is invariant to word order, and it is zero if two words are completely independent (e.g., 'horse' and 'is').

Figure 3

Mutual information gives a principled way to discover syntagmatic relationships. Mining a document corpus for the word pairs with the highest mutual information broadly indicates the contents of the documents. To demonstrate, I measure mutual information on a collection of 11 books (1,000,000+ words) from Project Gutenberg using the open source ModErn Text Analysis (MeTA) toolkit. I configured MeTA to preprocess the documents by removing stop words and by trimming word endings with the Porter2 stemming algorithm. Finally, the mutual information is measured on the resulting bag of words with code I wrote as part of the Coursera Text Mining course.

The 50 word pairs with the highest mutual information are shown in Figure 4. Many of the top word pairs are widely known fictional character names (e.g., 'holm sherlock' from The Adventures of Sherlock Holmes, 'sawyer tom' from The Adventures of Tom Sawyer, 'edward hyde' from The Strange Case of Dr. Jekyll and Mr. Hyde, 'miss woodhouse' from Emma, 'hare march' from Alice's Adventures in Wonderland). This immediately reveals the document corpus contains classic novels (indeed, ten novels plus one autobiography). Additionally, the word pairs reveal a diary belonging to a Dr. Seward (Dracula), that New York is a setting (The Autobiography of Benjamin Franklin), and that the documents mention various elderly individuals.

Figure 4: Word pairs with the highest mutual information

This simple analysis demonstrates how word association mining reveals the contents of documents. However, mutual information cannot tell us how pairs of correlated words are interrelated or which documents they are associated with.

Topic Modeling

A topic can be broadly defined as a theme or subject discussed in text data. Topics can be measured in different levels of granularity (sentences, articles, books). Documents in a collection may discuss multiple topics in different proportions. Topic mining simultaneously discovers topics in a document corpus and calculates the mixture of the different topics in each document. This makes the challenge of characterizing and understanding a large collection of documents less daunting. It's an important tool to have when you're at bat to save the world.

Probabilistic latent semantic analysis (PLSA) is one of the most basic topic modeling algorithms. In PLSA, documents are modeled as the mixture of topics. Every document has a different set of topic mixing weights. The topics are represented by unigram language models in which the probability of each word is independent. Each topic is a bag of words, containing every word in the vocabulary of the corpus, where each word is weighted by its probability. In a topic pertaining to sports, words like 'ball', 'team', 'player' would have high probability, while obviously unrelated words like 'spaceship', 'laser', and 'shark' would have infinitesimally small probabilities. The most probable words in the fitted topics are a useful summary of what the documents contain, but they do not detail every last nuance in the documents.

Figure 5 demonstrates how the probability of a word w in a document is calculated in PLSA. The probability of w is the weighted sum of the word's probabilities in all $\theta_j$ topic models, plus an additional background language model $\theta_B$, whose proportion $\lambda_B$ is fixed. The $\pi_{d,j}$ mixing weights determine the proportion of topic j in document d.

Figure 5

The goal of PLSA is to find the topic models and mixing weights that maximize the likelihood of observing all the documents in the corpus. The total number of parameters tracked is equal to the number of documents times the number of topics (i.e., the number of mixing weights) plus the number of words times the number of topics (because every word has a different probability in each topic); the number of parameters grows linearly with the size of the document corpus. The log-likelihood of the documents is given by Figure 6. The log terms make this optimization problem intractable. The iterative expectation-maximization (EM) algorithm is a viable method for solving the optimization.

Figure 6

As part of the Coursera Text Mining course, I had to implement the PLSA + EM algorithms. I reuse this code to demonstrate topic modeling on the 11 books from Project Gutenberg. The limitations of unigram language models mean that trying to fit many topics to a novel will not elucidate the various scenes or distinct segments in the book. I hypothesized that PLSA can at least capture a unique, dominant topic from each book. Therefore, I ran PLSA topic modeling on this corpus with k=11 topics for 200 iterations, assuming a background word distribution of 80%. The algorithm ran within about 10 seconds and mostly converged by iteration 50; not bad for fitting 11 topics to 1,000,000+ words on vintage-2012 hardware.

Figure 7 shows the top 20 words, ordered by descending probability, in each of the 11 topics. There are 24 words shared between multiple topics (e.g., 'father', 'boy', 'sister', 'it'); note, however, that positions of shared words in the lists vary because they have different probabilities in different topics. Several of the topics contain recognizable character names, indicating that certain topics are strongly associated with certain books (e.g., topic 1 for Dracula, topic 3 for The Adventures of Tom Sawyer, topic 7 for The Adventures of Alice in Wonderland, topic 8 for Emma, topic 10 for Frankenstein, topic 11 for The Autobiography of Ben Franklin). For other topics, it is not clear from Figure 7 alone (to me, at least) which book(s) they represent.

Figure 7

Topic 6 contains references to both The Adventures of Sherlock Holmes and The Strange Case of Dr. Jekyll and Mr. Hyde. This implies there is at least some mixing of documents within topics occurring. The degree of topic mixing is indicated by the mixing weights in the PLSA algorithm. Figure 8 shows the mixing weights, which are normalized so that summing over a fixed document yields one. As suspected, Topic 6 is a composite of The Adventures of Sherlock Holmes and The Strange Case of Dr. Jekyll and Mr. Hyde. Topics 2 and 5 represent Great Expectations. All other topics are devoted to a distinct book. It is noteworthy that PLSA finds separate topics for The Adventures of Tom Sawyer and The Adventures of Huckleberry Finn, which share the same author and have overlapping characters.

Figure 8

Hilary's Emails

In this last section, I explore topic modeling in a scenario with many documents where the number of topics is unknown. A suitable corpus to explore this matter is the Hilary Clinton email data set of 7945 emails (3+ million words) released by Kaggle. From the Kaggle database, I extracted the 'RawText' field, which represents the raw text, including sender/recipient metadata, scraped from the officially released PDF documents.

Thinking briefly about the contents of the emails helps develop an informed text mining strategy. The email data set probably addresses multiple subjects. Further, given the nature of email, we can anticipate each message will be narrowly focused so that a typical email does not represent all topics equally. Finally, the subjects will be highly specific; the most probable words in a topic will in general not be highly probable in other topics. These are well-justified beliefs, or priors, about how the text is structured.

The vanilla PLSA implementation I used above does not incorporate priors about the distribution or contents of the topics. PLSA worked well in the above section despite this, perhaps because there was a strong justification for choosing a specific number of topics.

Latent Dirichlet allocation (LDA) is a Bayesian version of PLSA that includes such priors. In LDA, Dirichlet priors $\alpha$ and $\beta$ influence the per-document topic distributions and the per-topic word distributions, respectively. The range of $\alpha$ and $\beta$ is (0,1]; small values indicate few topics per document or few words per topic. PLSA is equivalent to LDA with uniform Dirichlet prior distributions (i.e., $\alpha$ and $\beta$ set to one). Computationally, LDA is more sophisticated than PLSA, but, apart from priors, it has the added advantage that it is less prone to overfitting because fewer parameters are fitted. LDA performs at least as well as PLSA.

Given the size and complexity of the email data set, using LDA over PLSA may yield better results. Fortunately, MeTA contains efficient implementations of LDA. I chose to run the MeTA flavor of LDA that uses collapsed variational Bayes with $\alpha$ and $\beta$ fixed to 0.3. I ran the algorithm several times to explore the effect of fixing different numbers of topics (2, 4, 6, and 8). Each run took several minutes and hundreds of iterations to converge. Increasing the number of topics required more time and iterations. The eight-topic model took about an hour and 760 iterations to finish. I then evaluated the effectiveness of the topic modeling by studying the top 50 words in each topic for each run.

The results are encouraging and seem to effectively capture distinct themes. In the two-topic model, one emergent theme is email metadata; among the top words are 'subject', 'origin', 'date', 're', 'cc', days of the week, months, and email addresses (e.g., 'hdrclintonemailcom'). The second topic is less specific and seems to roll many topics regarding international affairs into one.

The four-topic model captures more granular themes: 1) email metadata, as before; 2) the Near and Middle East ('benghazi', 'libya', 'israel'); 3) foreign aid ('haiti', 'pakistan', 'usaid', 'women', 'food', 'support'); 4) politics, world leader names ('obama', 'democrat', 'labour', 'voter', 'koch').

In the six-topic model, the topic about the Near and Middle East is split so that words concerning Libya, Benghazi, and Egypt have their own topic.

Figure 9 shows the top 40 words for each topic in the eight-topic model. In my interpretation, topics 1-5, and 7 cover the Middle East, US politics, Mexico/South America/Ireland, Libya/Benghazi/Egypt, email metadata, and foreign aid respectively. The meanings of topics 6 and 8 are less clear to me. Topic 6 ranks highly words like 'arriv', 'meet', 'depart', and 'hotel', so it may refer to the logistics of travel or arranging meetings. Topic 8 is a mixture of words concerning international politics ('obama', 'tori', 'cameron') and military matters ('war', 'mcchrystal', 'militari'). Fitting eight topics gives a reasonable representation of the email contents, although fitting more may provide further refinement.

Figure 9

Viewing the mixing weights reveals how the fitted topics are distributed in the emails. Figure 10 shows the distributions of mixing weights, arranged by topic. Each subplot represents all 7945 emails. All topics have mixing weights spanning the range (0,1). For all topics, there are at least ~40 messages where that topic's mixing weight is above 0.8.

Figure 10

The meaning of the topic models can be further scrutinized by manually inspecting emails with particular topic weights. Inspection shows, for instance, that the topic 5 weights indicate how much message text there is relative to email metadata. Messages with high $\pi_{d,6}$ weights all contain itemized daily schedules, validating my suspicion that topic 6 concerns meeting logistics. The messages most closely related to topic 8, which I was also previously unsure about, address UK politics or UK-EU-US relations in various contexts (e.g., the Middle East, climate change, China).

Conclusion

Representing text data as a bag of words facilitates powerful text mining algorithms that can efficiently reveal the nature of unknown documents. Word association mining uses mutual information to pick out words appearing in phrases together. Topic mining goes a step further to associate words with distinct subjects and themes. The effectiveness of PLSA and LDA topic modeling with unigram language models has been demonstrated here. LDA is a Bayesian version of PLSA that incorporates priors while also being less prone to overfitting. These advantage make LDA a pragmatic choice for scenarios with many unknown documents, such as when you are burdened with the task of saving the world.

Sunday, April 17, 2016

Distinguishing Document Types With Part-of-Speech Fractions

Text mining refers to analytical procedures that turn text data into high-quality, actionable knowledge. It facilitates applications where human effort is minimized and decision-making ability is maximized. How text is represented inside a computer dictates what algorithms can be applied to derive meanings and inferences. Some ways a sentence can be represented are, in order of increasing complexity, as a sequence of words, words plus part-of-speech tags, syntactic structures (e.g., noun phrases, verb phrases), entities and relations, or logic predicates. Incorporating higher-level structures into text analysis requires progressively more human input and is prone to higher inaccuracy.

Part-of-speech (POS) tagging is one of the simplest ways to represent text. The tagging process, however, is nontrivial. For example, the part of speech a word represents can depend on the context in which the word is used (e.g., "Throw the book at him, Sheriff" versus "Book him, Chief"). At it's inception, POS tagging was done by hand. The field has sufficiently progressed that POS-taggers have been trained using advanced algorithms. State-of-the-art POS taggers have an accuracy of about 97%.

It strikes me that because some kinds of documents have very different writing styles (e.g., fictional novels versus scientific papers), they may have different POS fractions. If so, it is perhaps possible that one could distinguish various classes of documents with a vector of POS fractions alone. I explore this issue here.

Sample

I start by assembling a corpus of four very different document types.

Classic novels from Project Gutenberg (1,017,969 words):
Ten novels from Project Gutenberg were picked randomly from the site's list of most popular fiction. My selection includes books like "Adventures of Huckleberry Finn", "Dracula", and "Alice's Adventures in Wonderland". I downloaded the text in ASCII format directly from Project Gutenberg.

Astrophysics journal articles (285,785 words):
For this category, I used my PhD thesis plus 10 additional journal articles cited in my thesis. I converted the PDF-formatted files to ASCII with the pdftotext shell command.

Yelp restaurant reviews (1,030,909 words):
This sample of 10,000 Yelp reviews comes in ASCII format from the Coursera Text Mining and Analytics course I completed in Summer 2015.

Hilary Clinton's controversial e-mails (3,015,552 words):
This data was conveniently preprocessed and distributed as an SQLite database by Kaggle. From the database, I pulled the 'RawText' field, which represents the raw text, including sender/recipient metadata, extracted from the officially released PDF documents. There were 7945 e-mails in the data set.

POS Tagging

The POS tagging was performed using the open source MeTA (ModErn Text Analysis) toolkit. I learned to use this toolkit during the Coursera Text Mining and Analytics course. For this exercise, I used the built-in greedy Perceptron-based POS tagger, which performs at an accuracy of about 97%. This tagger applies a popular tagging convention for American English called the Penn Treebank tag set. This tag set is comprehensive. It calls for dividing some POS into subclasses (e.g., personal versus possessive pronouns), and it also accounts for symbols, foreign words, cardinal numbers, etc.

Figure 1 summarizes the results from POS tagging the four document classes. The horizontal axis includes only five POS (nouns, verbs, pronouns, adjectives, and adverbs). The fractions of other POS are either small or do not vary widely enough with document category to be interesting, and they are excluded from further analysis. The vertical axis represents the fraction of each POS normalized to the total counts of nouns, verbs, pronouns, adjectives, and adverbs within the documents; all subclasses in the individual POS categories have been merged for this calculation. Finally, all documents in a category were concatenated together and then fed to the POS tagger, so the fractions reported represent global fractions over the document classes. Since N=1 in every category, there are no statistical error bars. Instead, I show the 3% error bars representing the stated accuracy of the POS tagging, which would exceed the statistical error bars on the mean POS fractions (see below).

Figure 1

Figure 2 is the same, except that the global fractions are shown relative to the classic novels.

Figure 2

From Figures 1 and 2, striking differences are visible between the fractions of nouns, verbs, and pronouns. Classic novels have a noun fraction around 30%, compared to Yelp reviews at 39%. Astrophysics papers and Hilary's e-mails have much higher noun fractions around 65%. Verb fraction in the classic novels is comparable to the noun fraction, but in the other document classes, verb fraction is significantly (>3%) lower; verb fractions in the astrophysics papers and Hilary's e-mails are just ~15%. It is also interesting that the fraction of pronouns varies widely across the four categories, with the astrophysics papers having a pronoun fraction of only 2.5%; this is sensible as only few pronouns (we, our, they, their) commonly appear in scientific writing. The fractions of adjectives vary less widely and fall in the range of ~8-12%. The adverb fraction is comparable (~11%) in classic novels and Yelp reviews; the lower adverb fraction (~4%) in the astrophysics papers and Hilary's e-mails makes sense because the fraction of verbs is lower too.

Based on these differences, we can begin to understand how the different document classes cluster together in the 5-dimensional space of the POS fractions. Classic novels have fewer nouns but more verbs and pronouns than Yelp reviews. Scientific articles and Hilary's e-mails are seemingly hard to distinguish from each other, but compared to the novels and Yelp reviews, they have very high noun fractions and low verb, pronoun, and adverb fractions. This all provides a glimmer of hope that machine learning could be applied to correctly distinguish short excerpts from each document class. A casual web search finds no examples of POS fractions being used to distinguish document types, so this extra step is worth trying.

Before proceeding, let's test that the global trends in POS fraction extend to shorter excerpts from the documents. I remeasured the POS fractions by slicing the document classes into excerpts of lengths 300, 500, 1000, and 1500 words. Additionally, I took care to filter out punctuation using regular expressions to avoid a bias in my word counting algorithm.

Figure 3 shows the result of this exercise for an excerpt length of 500 words. The mean POS fraction is annotated on each subplot in Figure 3. The mean fractions all agree to within 1% of the global means shown in Figure 1, and the standard errors on the mean values are <1% in all cases. Thus, the POS mean fractions derived globally and from the shorter excerpts are robust.

Figure 3

Table 1 shows the distributions in Figure 3 have scatter (1 sigma standard deviations) between ~2 and ~14%. The noun fractions in the astrophysics papers and Hilary e-mails have high ~12-14% standard deviations. Inspecting the document excerpts shows the very high noun fractions appear in the bibliography portions of astrophysics papers and in the headers of Hilary's e-mails. Increasing the excerpt length above 500 words reduces, but does not eliminate, the scatter because the POS fractions fluctuate less over larger chunks of words. This scatter is going to limit the accuracy of any trained classification system.

Document Classification

The goal in this section is to build a system that can correctly identify the document class of an excerpt from just a vector of the five POS fractions (noun, verb, pronoun, adjective, adverb). This constitutes a supervised learning problem because a model must first be trained and validated with example vectors and their corresponding category labels. A support vector machine (SVM) is an appropriate supervised learning method to use here; alternatives are logistic regression and neural networks. Training an SVM on vectors of POS fractions will define boundaries in POS-fraction space where the gaps between document categories are as wide as possible.

I proceed using the SVM implementation in the scikit-learn machine learning module for Python. Specifically, I use the Support Vector Classification (SVC) class, which has free parameters for the kernel, regularization (i.e., management of overfitting), and category weights. For the kernel, I use a Gaussian kernel with the 'gamma' parameter set to 1/n_features (i.e., 1/5).

Following the model validation technique presented in the Coursera Machine Learning course, I randomize the data (the POS vectors and their labels) into training, cross-validation, and test subsets in a 60/20/20% split. The document categories are of different sizes, and this means there are many more excerpts from the category with the most words (Hilary's e-mails) than the category with the fewest words (astrophysics papers). In all three subsets, I require the document categories to appear in proportions reflecting the relative sizes of the document corpuses (i.e., 18% classic novels, 19% Yelp reviews, 5% astrophysics papers, 58% Hilary e-mails). Likewise, I initialize the SVM with class weights inversely proportional to the category frequencies in the input data.

The SVM is trained with regularization parameters ranging from 1-1e5. The cross-validation set is used to pick the optimal regularization parameter, which is the one yielding the highest classification accuracy on the cross-validation set. Using instead the training set classification accuracy for this decision is bad practice, as it allows the training set to bias the final model. Finally, the generalized performance of the classifier is quantified as the test set classification accuracy.

The classification results are shown in Table 2 for excerpt lengths of 300, 500, 1000, and 1500 words. The size of the training set simultaneously decreases with increasing excerpt length because the same total number of words is divided into progressively larger blurbs. Despite the decreasing size of the training set, the overall classification accuracy (A_Overall) increases with excerpt length. As already noted concerning Figure 3, increasing the excerpt length reduces the scatter in the distributions of POS fractions, and it must be this effect that explains the increase in classification accuracy with excerpt length.

Classification accuracy within each class generally increases with excerpt length too, although there are instances where it stays about the same or decreases. Excerpts of 500 words are enough achieve classification accuracies of ~80% or higher in the four categories; increasing excerpt length to 1500 words yields accuracies of ~92% or higher. The classification accuracy for the Hilary e-mails (A_Hilary) is most affected by excerpt length, probably because this corpus shows the highest scatter in its distributions of POS fractions (Figure 3, Table 1). The astrophysics papers and the Hilary e-mails have similar POS fractions (Figures 1-2), yet the classifier is not confusing them catastrophically; this highlights the power of machine learning.

Conclusion

Different kinds of documents can contain disparate POS fractions. With machine learning, the document classes of short excerpts can be correctly identified from POS fractions alone. Reducing a document to a vector of POS fractions effectively discards specific knowledge of the entities and relations encapsulated in the words. It is striking that a high classification accuracy (Table 2) is still achievable after this conversion.

Document classification from POS fractions alone is a neat trick. However, I don't see it as practical. POS tags can be extra features within other powerful text mining algorithms that also directly incorporate the words in the document. In forthcoming posts, I will explore these and other advanced text mining algorithms.