For The Love Of Data Science: 2015

Monday, December 28, 2015

The Geographic Distribution of US Salaries and Living Costs

Recently, Hired has released a State of US Salaries Report. The goal of the report is to empower job candidates by helping them understand their market value. Reports of this kind are important in light of the deficiencies shared by existing salary data sources, where data may be skewed by self-reporting and factors like cost of living, experience, and company size may not be considered.

The business model of Hired is to pair job candidates with companies. Hired has exploited their proprietary dataset to examine salaries for software engineers in major cities around the US. Particularly interesting to me is the discussion of how salary scales (or not) with cost of living.

I wanted to know if it were possible to perform an independent salary versus cost-of-living analysis for many job titles simultaneously using freely available online salary tools. Major online salary tools include Payscale and Salary.com. Basic usage of these websites involves specifying a job title, plus additional modifiers like location, experience, and company size. Both services ultimately present a distribution of salaries given the specified information. Payscale asks for a lot of information with every query, but this makes the interface too cumbersome to efficiently extract salary information for the top 100 largest US cities times multiple job titles.

Salary.com is more amenable to this task. Their Salary Wizard only requires a valid job title and city-state combination. It returns a graph of salary distributions for the job title and location. The documentation says localized salary distributions are based on national average salary distributions that are scaled by a 'geographic salary equivalent factor' analogous to a cost-of-living correction.

The queries made by salary.com produce predictable web urls of the form:
http://swz.salary.com/SalaryWizard/[Job]-Salary-Details-[city]-[state].aspx

Examples:
http://swz.salary.com/SalaryWizard/Web-Designer-Salary-Details-Alexandria-VA.aspx
http://swz.salary.com/SalaryWizard/Web-Designer-Salary-Details-San-Jose-CA.aspx

Note the salary distribution plot is drawn in html and is not a static image. This means the salary numbers in the plot are all scrapable. You just need to inspect the html source code to understand the structure of the tags encoding the desired numbers. The median salary, for example, is stored in the html tag with id='basemid'.

Knowing this, I used Python to generate and scrape URLs for the 100 largest US cities times 4 arbitrary job categories (data warehouse specialist, high school teacher, orthodontist, and surgeon). For each request, the web page was pulled down with Python urllib2, and the html source code was parsed with BeautifulSoup to extract salary numbers for the 25%, 50% (i.e., median), and 75% percentiles. Scraping all 400 pages took a few minutes, whereas pulling the data down by hand would have taken me several hours. This is an example of web scraping without web crawling.

It is efficient to view the salary information in a 2D map with cities color coded by salary. Figure 1 is one example I made for the median salary of data warehouse specialists using Basemap. Note salaries are higher in cities known to be more expensive, like San Francisco and New York.

Figure 1: Median salary for data warehouse specialists in major cities across the US.

We can further attempt to understand the geographic salary equivalent factor applied by Salary.com to generate localized salary distributions. Those corrections are proprietary, but we can pick a city to normalize against and see how geographic corrections compare for different cities and job titles. Figure 2 shows median salary for data warehouse specialists relative to the median salary for that job in Austin, TX. Austin is a reasonable standard because it is neither the most expensive nor cheapest city.

Figure 2: Salaries relative to the median data warehouse specialist salary paid in Austin, TX.

Repeating this exercise for the other three job titles (high school teacher, orthodontist, and surgeon) yield exactly the same result. All localized salaries reported by Salary.com scale relative to Austin in precisely the same way; Salary.com has assigned an industry-independent correction factor to every locale. They assume that the median salary for all jobs across the US ranges between about 80% and 127% of the median salary in Austin.

Next, let's consider how the Salary.com geographic correction factors relate to cost-of-living indexes. I evaluate three different cost-of-living measures: consumer price index, consumer price + rent index, and the money.cnn.com cost-of-living calculator. Numbeo.com has published a free list of year-2015 consumer price indexes indexes for major cities. These indexes were available for 53% of the cities in my study. The numbeo.com consumer price index used here reflects consumer prices for groceries, restaurants, transportation, and utilities, but not rent or mortgage.

The left panel of Figure 3 compares the Salary.com geographic correction with the consumer price index. All values are normalized to Austin, TX. The red error bars represent the 25% and 75% percentiles on the Salary.com salary distributions. The geographic corrections seem reasonably consistent with the consumer price indexes, with less good agreement on the low and high end of the numbeo.com axis.

Figure 3

The relative prices of consumer goods and rent can be disparate in certain locations, such as Austin. Fortunately, numbeo.com also publishes a consumer price + rent index. The middle panel compares the same geographic correction against this consumer price + rent index. In the middle panel of Figure 3, the Salary.com geographic adjustments do not agree as well with the consumer price + rent index. Taking this comparison at face value suggests that salaries in cheaper places to live (lower consumer price + rent index) enable a higher standard of living compared to more expensive cities.

Finally, let's compare the Salary.com geographic correction with the cost-of-living differences inferred from the money.cnn.com cost-of-living calculator. I assume this is the most reliable of the three cost-of-living measures because it includes groceries, housing, utilities, transportation, and health care. With this tool, I was able to measure the living expenses relative to Austin for 69% of my cities. The right panel of Figure 3 shows the Salary.com geographic correction tends not to scale with living costs for cities >20% more expensive than Austin (e.g., Boston, Chicago, LA, San Francisco, Portland etc.).

In summary, I scraped Salary.com for localized salary distributions in major US cities. Renormalizing the salaries relative to Austin reveals the geographic correction factors that were applied to national average salary distributions. Salary.com assumes compensation for all jobs in a local market scale by the same industry-independent geographic correction, which may or may not reflect reality. If the Salary.com geographic correction accurately reflects real geographic differences in salaries, then the compensation paid by employers does not on average scale with the cost of living in the most expensive US cities.

Sunday, December 27, 2015

Does Congressional Approval Rating Vary Across States in the Union?

This post is adapted from a statistics project I completed as part of the Data Analysis and Statistical Inference course on Coursera in April 2015.

Introduction

The purpose of this report is to explore how Congressional approval rating varies by US state. Nationwide, the vast majority of Americans disapprove of Congress. I want to explore if this varies by state and see if any specific groups of states disapprove of Congress more than the rest of the country. This question should be of interest to all Americans since it is in our interest to have a Congress that meets the needs of the electorate.

Data
The data for this exercise comes from American National Election Studies (ANES). Topics in ANES cover voting behaviour and the elections, together with questions on public opinion and attitudes of the electorate. In all Time Series studies, an interview is completed just after the election (the Post-election or “Post” interview); during years of Presidential elections an interview is also completed just before the election (the Pre-election or “Pre” interview). Thus, every “case” in the data set is a person who was questioned before and after the 2012 election.

This data set constitutes a retrospective observational study. The two variables I use are categorical: the US state where the individual resides (and voted), and their Congressional approval rating (“Approve” versus “Disapprove”). Since this is an observational study with random sampling but not random assignment, the results will be generalisable to all people in various US states. Of course, one possible confounding bias is non-response bias. Retrospective observational studies do not allow for random assignment. No random assignment means this study cannot reveal a causal relation (e.g., living in Maryland makes people hate Congress more).

Exploratory Data Analysis

The data set contains 6300 respondents who have given a Congressional approval rating. Of these respondents, only 1643 (26.1%) actually approve of Congress. This is an alarming statistic by itself.
The number of respondents in each state varies from 3 to 745 with a median of 82. The two states with fewer than 5 respondents are Arkansas and Wyoming. These states will be excluded from the inference study below.

Figure 1 below shows the percentage of approval ratings by US state, with bar names ordered alphabetically by state abbreviation. The approval percentage rating by state varies all the way from 8.7% (MT) to 42.2% (NM). There is clearly a lot of variation around the 26.1% overall average. This is the justification to look for differences between states.

Figure 1

Inference

Now, it is time to make an inferences as to whether or not there are real differences between states in the approval rating. The null hypothesis is that there is no statistically significant variation between states above or below the 26.1% overall approval rating. My alternate hypothesis is that there is a difference, and that the actual approval rating in each state could be statistically different from this overall average.

The appropriate test to use is the chi-square test of independence. The test requires that the samples be independent. This requirement is satisfied because:

the survey used random sampling,
the number of people sampled is <10% of the population,
each case in the survey only contributes to one cell in a contingency table of approval rating versus state.

The sample size requirement means that every valid state must have at least 5 approval and disapproval ratings. Three states fail to have at least 5 of both ratings (AK, HI, WY). Eight more states have <5 approval ratings but >=5 disapproval ratings (DC, ID, ME, MT, ND, NH, SD, VT); in some cases the number of disapproval ratings is >=10 (ID, ME, MT, NH). These states are all discarded as well, leaving 40 valid states to consider.

Given the overall approval rating of 26.1%, I calculate the expected number of approvals and disapprovals for each state based on how many responses were received for each state. Then, I calculate the chi-square by summing over the difference between actual and expected approvals (disapprovals) squared, divided by the expected number of approvals (disapprovals). The resulting chi-square statistic is 99.96. The number of degrees of freedom is 39 given their are 40 valid states.

The appropriate chi-square distribution is visualized in Figure 2.

Figure 2

The area under the curve beyond 99.96 is infinitesimally small, and the corresponding p-value is 2.96e-07, well below the standard significance level of 0.05. There is no question that there is statistically significant variation in approval rating between the states, and I note the result would not have changed if the excluded states were folded in.

Which states are contributing most to chi-square? The table of state, chi-square, and approval rating is listed below, in order of decreasing chi-square.

## MD  11.83  0.1154
## TX  8.89  0.2948
## CA  8.27  0.2763
## SC  7.51  0.3462
## NM  6.37  0.3649
## LA  5.07  0.3388
## VA  4.58  0.1774
## TN  3.66  0.1835
## MS  3.21  0.3846
## FL  2.97  0.2854
## MA  2.80  0.3209
## CT  2.36  0.1875
## CO  2.35  0.1954
## KY  2.30  0.1928
## OR  2.20  0.2000
## NY  1.90  0.2889
## IA  1.85  0.1944
## UT  1.70  0.1923
## AL  1.69  0.3214
## AZ  1.68  0.2143
## NV  1.57  0.1935
## OH  1.54  0.2469
## WA  1.48  0.2805
## KS  1.45  0.1714
## NC  1.10  0.2797
## IL  1.03  0.2672
## OK  1.02  0.2273
## MN  1.00  0.2273
## IN  0.93  0.2627
## DE  0.88  0.3333
## PA  0.85  0.2580
## RI  0.77  0.2857
## MI  0.59  0.2568
## AR  0.51  0.2143
## NE  0.49  0.2750
## WV  0.38  0.2162
## WI  0.31  0.2481
## GA  0.30  0.2616
## NJ  0.29  0.2532
## MO  0.28  0.2523

Conclusion

I have found strong evidence of a statistically significant state-by-state variation in Congressional approval rating, such that it is sufficient to conclude that the approval rating in each state is not simply the overall average approval rating. Rather, there are states with approval ratings significantly above (e.g., NM) and below (e.g., MD) this average. This begs the question of why the Congressional approval rating varies so high or low compared to the overall average approval rating. Obvious variables to consider are markers of economic success, like statewide employment rates, income, and health care coverage.

Citation
The American National Election Studies (ANES). The ANES 2012 Time Series Study [dataset]. Stanford University and the University of Michigan [producers].

These materials are based on work supported by the National Science Foundation under grants SES-0937727 and SES-0937715, Stanford University, and the University of Michigan.

Any opinions, findings and conclusions or recommendations expressed in these materials are those of the author(s) and do not necessarily reflect the views of the funding organizations.

Link to data: “http://bit.ly/dasi_anes_data”