Monday, September 15, 2014

Web Crawling For Job Postings

Overview of Web Crawling

Web crawling is the practice of systematically navigating and parsing websites to gather information. Web crawling creates the potential for anyone, including data scientists, to forge new data sets from information scattered over the web. A large-scale application of web crawling is web indexing by search engine companies. Indeed.com also uses web crawling to collect job postings. Other companies view web crawlers as hostile.  Linkedin, for example, considers any web crawling to be a violation of their Terms of Service (although it can still be done).

For startups, web crawling can be useful in some niche applications and product research and development. However, building a startup's main product or service around content scraped off the website of a large and potentially hostile company is not smart business. 

How Web Crawling Works

Websites are coded in the markup language HTML, with the information visualized in a web browser being embedded in a series of brackets or "tags", so that HTML source code is not always easy to make sense of by eye.  Web crawling exploits the well-structured nature of HTML to provide the means (e.g., selection policies) of locating, interacting with, and extracting information from relevant web page elements. The Firefox extension Firebug is an invaluable tool for identifying the relevant HTML tags when building selection policies for a web crawler.

Scrapy is a popular example of a web crawling framework implemented in Python. It is problematic to crawl certain content, such as dynamic search results, rendered with Javascript using a tool like Scrapy alone. Pairing Scrapy with the Selenium WebDriver makes for a solution that can crawl anything displayed in a standard web browser.

Crawling Indeed.com For Job Postings

Here, I discuss the application of web crawling to mining job postings on Indeed.com. I prepared a crawler using Scrapy plus the Selenium WebDriver. The final implementation is available on Github. The web crawler performs the following steps in sequence:
  1. Navigate to indeed.com.
  2. Fill in the search form for "what" (as in what job?) and "where" (as in located where?).
  3. Execute the search by hitting the "Find Jobs" button.
  4. Navigate the search results and for each job listing record position title, company, location, and a URL linking to Indeed's full record of that job.
The indeed.com homepage.

There is one interesting limitation of Indeed's interface worth pointing out. I illustrate with a sample search for "data science scientist" in the "usa". These search terms capture jobs for data scientists, analysts, academic and industrial researchers, and engineers. Result page 1 shows jobs 1-10 out 7,177.

The first result page shows results 1-10.

With 7,177 results, there should be 718 result pages, yet the interface only shows the first 100 result pages, meaning only the first 1,000 search results are accessible by scraping.

Only the first 100 search result pages are accessible in practice!

The obvious workaround is to "divide and conquer" by breaking the USA-wide search into separate searches for the 50 states plus Washington D.C. This is straightforward to do with a bash script.  If the search results were equally distributed across the 51 geographic regions, there would be 140 results per search, well below the limit of 1,000.  Below are the regions with the top 10 and bottom 10 number of search results for the "data science scientist" search performed on August 30, 2014.

Top 10 search results:
  1. CA (1792)
  2. MA (502)
  3. NY (427)
  4. D.C. (416)
  5. MD (304)
  6. WA (260)
  7. VA (240)
  8. PA (231)
  9. TX (225)
  10. IL (187)
Bottom 10 search results:
  1. SC (11)
  2. AR (11)
  3. VT (11)
  4. MT (11)
  5. ND (8)
  6. WY (7)
  7. RI (7)
  8. MS (7)
  9. NJ (0)
  10. LA (0)
The search results are not uniformly distributed across the USA.  California is the one region where the number of search results is above 1,000.  Retrieving all relevant job listings for California given these search terms would require parsing the state into subregions.

No comments:

Post a Comment