For The Love Of Data Science: 2014

Overview of Web Crawling

Web crawling is the practice of systematically navigating and parsing websites to gather information. Web crawling creates the potential for anyone, including data scientists, to forge new data sets from information scattered over the web. A large-scale application of web crawling is web indexing by search engine companies. Indeed.com also uses web crawling to collect job postings. Other companies view web crawlers as hostile. Linkedin, for example, considers any web crawling to be a violation of their Terms of Service (although it can still be done).

For startups, web crawling can be useful in some niche applications and product research and development. However, building a startup's main product or service around content scraped off the website of a large and potentially hostile company is not smart business.

How Web Crawling Works

Websites are coded in the markup language HTML, with the information visualized in a web browser being embedded in a series of brackets or "tags", so that HTML source code is not always easy to make sense of by eye. Web crawling exploits the well-structured nature of HTML to provide the means (e.g., selection policies) of locating, interacting with, and extracting information from relevant web page elements. The Firefox extension Firebug is an invaluable tool for identifying the relevant HTML tags when building selection policies for a web crawler.

Scrapy is a popular example of a web crawling framework implemented in Python. It is problematic to crawl certain content, such as dynamic search results, rendered with Javascript using a tool like Scrapy alone. Pairing Scrapy with the Selenium WebDriver makes for a solution that can crawl anything displayed in a standard web browser.

Crawling Indeed.com For Job Postings

Here, I discuss the application of web crawling to mining job postings on Indeed.com. I prepared a crawler using Scrapy plus the Selenium WebDriver. The final implementation is available on Github. The web crawler performs the following steps in sequence:

Navigate to indeed.com.
Fill in the search form for "what" (as in what job?) and "where" (as in located where?).
Execute the search by hitting the "Find Jobs" button.
Navigate the search results and for each job listing record position title, company, location, and a URL linking to Indeed's full record of that job.

The indeed.com homepage.

There is one interesting limitation of Indeed's interface worth pointing out. I illustrate with a sample search for "data science scientist" in the "usa". These search terms capture jobs for data scientists, analysts, academic and industrial researchers, and engineers. Result page 1 shows jobs 1-10 out 7,177.

The first result page shows results 1-10.

With 7,177 results, there should be 718 result pages, yet the interface only shows the first 100 result pages, meaning only the first 1,000 search results are accessible by scraping.

Only the first 100 search result pages are accessible in practice!

The obvious workaround is to "divide and conquer" by breaking the USA-wide search into separate searches for the 50 states plus Washington D.C. This is straightforward to do with a bash script. If the search results were equally distributed across the 51 geographic regions, there would be 140 results per search, well below the limit of 1,000. Below are the regions with the top 10 and bottom 10 number of search results for the "data science scientist" search performed on August 30, 2014.

Top 10 search results:

CA (1792)
MA (502)
NY (427)
D.C. (416)
MD (304)
WA (260)
VA (240)
PA (231)
TX (225)
IL (187)

Bottom 10 search results:

SC (11)
AR (11)
VT (11)
MT (11)
ND (8)
WY (7)
RI (7)
MS (7)
NJ (0)
LA (0)

The search results are not uniformly distributed across the USA. California is the one region where the number of search results is above 1,000. Retrieving all relevant job listings for California given these search terms would require parsing the state into subregions.

In this post, I discuss the machine learning experiment I conducted to distinguish male versus female genders using pictures of faces. This is an interesting problem because, while faces greatly vary from one to another, there are clear differences between masculine and feminine facial characteristics. A successful solution to this problem is relevant to many applications, including human-computer interactions, security, and marketing research.

Sample Selection and Preprocessing

For this experiment, I used grayscale digital photographs of the students, staff, and faculty of the Astronomy Department at the University of Texas at Austin. This 176 photo data set contains a reasonable mixture of males and females (65% male, 35% female) and represents a wide range of ages, from undergraduates to senior faculty. The pictures were also taken with same camera and a uniform configuration. Additionally, many of the faces in this collection contain the confounding factors of facial hair and/or glasses. A detailed characterization of the sample is provided below.

The images are high-resolution (1050x1500 pixels) and capture each person from the chest up in a "school-portrait" style. I preprocessed the images to extract just the face region using my own interactive tool built with python+matplotlib. The tool worked by displaying the original image and then extracting a 500x500 pixel slice centered at a specified point. This point was roughly each person's nose, although the exact position varied depending on the orientation of the face. The choice of the 500x500 pixel slice was determined through experimentation to provide the best results. This tool is demonstrated below for my own picture in the collection.

The resulting 500x500 pixel images contain 250,000 pixels each. This is far too large for my vintage-2004 computer hardware to handle. The images were resized to be manageable, namely 50x50 pixels.

Finally, I randomly shuffle these images into three subsets: a training set, a cross validation set, and a test set, in a 60:20:20% proportion, respectively, with the constraint that the percentage of males and females in each set must match that of the entire parent sample (i.e., 65% male, 35% female). The training and cross validation sets are used to train and tune the machine learning algorithm, and the test set is used to make the final evaluation of gender classification performance.

Dimensionality Reduction

Now we have images that are of a practical size. However, there is still a problem. Each image provides an array of 2,500 pixels carrying grayscale intensity values. This is probably not helpful because skin color alone is independent of gender; I am not convinced the pixel values themselves can convey anything about masculine or feminine facial characteristics, at least without more advanced analysis of the pixel intensity distributions.

Therefore, I adopt the approach of running a principle component analysis (PCA) to reduce the pixels in each image from 2,500 to a smaller number K. PCA finds the optimal set of K vectors such that the projection of the 2,500 pixel images onto these vectors minimizes the projection error. I determine what these vectors are from the training set. The precise number K to use is somewhat arbitrary, and I choose the first 86 vectors to retain "99% of the variance" in the training set images (this means the images can be reconstructed from the first 86 principle components while losing only very little detail). The cross-correlation and test sets are projected onto these same 86 vectors determined from the training set. Thus, the dimensionality of this problem has been compressed by 97%, from 2,500 pixels to 86 pixels!

The Artificial Neural Network (ANN)

ANNs are machine learning algorithms that imitate the functioning of the human brain. A network comprises of computational nodes arranged into input, "hidden", and output layers. I chose the simplest model for this experiment, a network with an input layer, one hidden layer, and an output layer. The input layer consists 86 nodes that represent each pixel of the image projections plus one bias unit carrying a fixed value of 1, for a total of 87 nodes. The output layer has two nodes that denote the male or female gender classifications. The number of nodes in the hidden layer is again somewhat arbitrary, yet intuitively, the hidden layer should have fewer nodes than the input layer but more than the output layer. I specified the hidden layer to have 10 nodes plus one bias unit. The implementation of this network was based on the the code I wrote as part of successfully completing the Coursera Machine Learning course. A visual representation of the network is shown below.

The value in the k-th node in the hidden and output layers is calculated using the sigmoid function via

(for j=2, 3), where the Θ values represent weights that must be trained before the ANN has any predictive power. Aside from the weight at each node, the ANN carries one additional free parameter (Λ) to control regularization. This feature can be adjusted as needed to prevent overfitting or underfitting of the training data.

Training Methodology and Results

Using the training set data, I optimized the weights in the ANN with the backpropagation algorithm by minimizing a merit function known as the "cost". Training the regularization parameter requires the cross validation set. For several values of Λ spanning 0 (no regularization) to 200 (high regularization), the ANN weights were trained with the training set. For each value of Λ, the associated set of weights was used to calculate the cost for each of the training and cross validation sets. The training accuracy (percentage of correctly classified faces) obtained from this exercise ranged from 71.6-100% for the training set and 64.7-85.3% for the cross validation set. The best Λ corresponds to the minimum cost for the cross validation set. In this case, the best solution for the ANN comes from training the network with Λ=8. The figure below shows how the cost varies with Λ.

With the ANN now fully trained, the final evaluation of performance is made with the test set. The test set consists of 26 males and 14 females. A total of 37/40 (92.5%) faces were correctly classified as male or female, which is comparable the accuracy achieved by human classifiers. Of the three misclassifications, two were false positives (males classified as females), and other was a false negative (female classified as male). The low number of false positives and false negatives means this ANN has demonstrably high precision and recall, which is ideal for machine learning implementations.

For The Love Of Data Science

Monday, September 15, 2014

Web Crawling For Job Postings