The business model of Hired is to pair job candidates with companies. Hired has exploited their proprietary dataset to examine salaries for software engineers in major cities around the US. Particularly interesting to me is the discussion of how salary scales (or not) with cost of living.
I wanted to know if it were possible to perform an independent salary versus cost-of-living analysis for many job titles simultaneously using freely available online salary tools. Major online salary tools include Payscale and Salary.com. Basic usage of these websites involves specifying a job title, plus additional modifiers like location, experience, and company size. Both services ultimately present a distribution of salaries given the specified information. Payscale asks for a lot of information with every query, but this makes the interface too cumbersome to efficiently extract salary information for the top 100 largest US cities times multiple job titles.
Salary.com is more amenable to this task. Their Salary Wizard only requires a valid job title and city-state combination. It returns a graph of salary distributions for the job title and location. The documentation says localized salary distributions are based on national average salary distributions that are scaled by a 'geographic salary equivalent factor' analogous to a cost-of-living correction.
The queries made by salary.com produce predictable web urls of the form:
http://swz.salary.com/SalaryWizard/[Job]-Salary-Details-[city]-[state].aspx
Examples:
http://swz.salary.com/SalaryWizard/Web-Designer-Salary-Details-Alexandria-VA.aspx
http://swz.salary.com/SalaryWizard/Web-Designer-Salary-Details-San-Jose-CA.aspx
Note the salary distribution plot is drawn in html and is not a static image. This means the salary numbers in the plot are all scrapable. You just need to inspect the html source code to understand the structure of the tags encoding the desired numbers. The median salary, for example, is stored in the html tag with id='basemid'.
Knowing this, I used Python to generate and scrape URLs for the 100 largest US cities times 4 arbitrary job categories (data warehouse specialist, high school teacher, orthodontist, and surgeon). For each request, the web page was pulled down with Python urllib2, and the html source code was parsed with BeautifulSoup to extract salary numbers for the 25%, 50% (i.e., median), and 75% percentiles. Scraping all 400 pages took a few minutes, whereas pulling the data down by hand would have taken me several hours. This is an example of web scraping without web crawling.
It is efficient to view the salary information in a 2D map with cities color coded by salary. Figure 1 is one example I made for the median salary of data warehouse specialists using Basemap. Note salaries are higher in cities known to be more expensive, like San Francisco and New York.
Figure 1: Median salary for data warehouse specialists in major cities across the US. |
We can further attempt to understand the geographic salary equivalent factor applied by Salary.com to generate localized salary distributions. Those corrections are proprietary, but we can pick a city to normalize against and see how geographic corrections compare for different cities and job titles. Figure 2 shows median salary for data warehouse specialists relative to the median salary for that job in Austin, TX. Austin is a reasonable standard because it is neither the most expensive nor cheapest city.
Figure 2: Salaries relative to the median data warehouse specialist salary paid in Austin, TX. |
Repeating this exercise for the other three job titles (high school teacher, orthodontist, and surgeon) yield exactly the same result. All localized salaries reported by Salary.com scale relative to Austin in precisely the same way; Salary.com has assigned an industry-independent correction factor to every locale. They assume that the median salary for all jobs across the US ranges between about 80% and 127% of the median salary in Austin.
Next, let's consider how the Salary.com geographic correction factors relate to cost-of-living indexes. I evaluate three different cost-of-living measures: consumer price index, consumer price + rent index, and the money.cnn.com cost-of-living calculator. Numbeo.com has published a free list of year-2015 consumer price indexes indexes for major cities. These indexes were available for 53% of the cities in my study. The numbeo.com consumer price index used here reflects consumer prices for groceries, restaurants, transportation, and utilities, but not rent or mortgage.
The left panel of Figure 3 compares the Salary.com geographic correction with the consumer price index. All values are normalized to Austin, TX. The red error bars represent the 25% and 75% percentiles on the Salary.com salary distributions. The geographic corrections seem reasonably consistent with the consumer price indexes, with less good agreement on the low and high end of the numbeo.com axis.
Figure 3 |
The relative prices of consumer goods and rent can be disparate in certain locations, such as Austin. Fortunately, numbeo.com also publishes a consumer price + rent index. The middle panel compares the same geographic correction against this consumer price + rent index. In the middle panel of Figure 3, the Salary.com geographic adjustments do not agree as well with the consumer price + rent index. Taking this comparison at face value suggests that salaries in cheaper places to live (lower consumer price + rent index) enable a higher standard of living compared to more expensive cities.
Finally, let's compare the Salary.com geographic correction with the cost-of-living differences inferred from the money.cnn.com cost-of-living calculator. I assume this is the most reliable of the three cost-of-living measures because it includes groceries, housing, utilities, transportation, and health care. With this tool, I was able to measure the living expenses relative to Austin for 69% of my cities. The right panel of Figure 3 shows the Salary.com geographic correction tends not to scale with living costs for cities >20% more expensive than Austin (e.g., Boston, Chicago, LA, San Francisco, Portland etc.).
In summary, I scraped Salary.com for localized salary distributions in major US cities. Renormalizing the salaries relative to Austin reveals the geographic correction factors that were applied to national average salary distributions. Salary.com assumes compensation for all jobs in a local market scale by the same industry-independent geographic correction, which may or may not reflect reality. If the Salary.com geographic correction accurately reflects real geographic differences in salaries, then the compensation paid by employers does not on average scale with the cost of living in the most expensive US cities.
Awesome work Tim!
ReplyDelete