Quantitative Methods: April 2015

Wednesday, April 29, 2015

Regression Analysis

Part I:

This portion of the lab introduces a scenario in which there is a study conducted on crime rates and poverty in a town. The press latches onto a statistic that claims that as the number of kids eating free lunch increase, so does crime. The object of this section is to determine if this has any validity, and determine what percentage of kids on free lunch would yield a crime rate of 79.7.

A table was provided including two columns for each variable: Percent Free Lunch and Crime Rate. SPSS was used to perform linear regression analysis to better understand the relationship between them, with percent free lunch as the independent variable, and crime rate as the dependent variable. The data was returned as follows:

The above images show the SPSS output for a linear regression analysis on percent free lunch and crime rate variables. Based on the significance value on the bottom right of the bottom table (.005) the null hypotheses that there is no linear relationship between the two variables should be rejected in favor of the alternative hypothesis, which states that there is a linear relationship between the variables because it is outside of the 95% significance bounds of .05. However, with such a low R Square statistic, this relationship is not very strong, and it's predictive value is very limited.

The resulting data equation from the above output can be put into the regression equation of y=a+bx
Doing this would result in the equation: Y = 21.819 + 1.685x
To calculate the percentage of persons getting free lunch with a crime rate of 79.7, the equation can be transformed as follows:

79.9 = 21.819 + 1.685x
x = (79.9-21.819)/1.685

x = 35.35%

With a crime rate of 79.7, the percentage of kids getting free lunch would be about 35%. It is important to keep in mind, though, that since the R Square statistic is very low, this value cannot be relied on to be very accurate!

Part II: Spatial Auto-correlation

Introduction:

This portion of the exercise included a scenario in which the UW System has requested an analysis of enrollment with respect to a number of variables. These include distance from each school, percent with a bachelor's degree, and median household income. This analysis can be done using linear regression analysis on these variables with the help of SPSS statistics editor. The output from the regression analyses combined with spatial representations of the models' residuals can provide input into why students choose the schools they go to.

Methods:

Data was provided by the professor in the form of a spreadsheet including enrollment data from all UW schools, and some census statistics. The next step was to create a new column as population divided by distance. This is a way to normalize the data to ensure that just because a county has a higher population, the number of students enrolled won't be off the charts. Conversely, counties right next to a school won't be unreasonably favored (because more students will attend closer colleges).

After this column was created and calculated in Excel, regression analysis was performed on these variables with relation to two schools: UW-Eau Claire, and UW- Oshkosh.

Eau Claire:

This image shows the results of the regression analysis with median household income as the independent variable, and UWEC enrollment as the dependent variable. With a significance of .104, you would fail to reject the null hypothesis which states that there is not a linear relationship between the two variables.

This image shows the results with Percent with a Bachelor's degree as the independent variable and Eau Claire enrollment as the dependent. This significance is below .05, so you would reject the null hypothesis (same as example above) in favor of the alternative hypothesis which states that there is a linear relationship between the two variables. In this output, the R square value is very low, so the explanatory power of the model is not ideal.

This image shows the results from regression with Population per distance as the independent variable, and UWEC enrollment as the dependent. The significance is .000 so you would reject the null hypothesis in favor of the alternative hypothesis in the same manner as the above output. The difference here, is that the R Square value is much higher. The number of .945 essentially means that entries would fit the model 94.5% of the time. This is a very high strength, positive relationship.

This image shows the results from regression performed on median household income and UW-Oshkosh's enrollment. This relationship is statistically significant, but the R Square value is low.

This image shows the results from regression analysis performed on Percent with a bachelor's degree as the explanitory variable and Oshkosh enrollment as the dependent variable. Again, the relationship is statistically significant, but the R Square value is low, so the model isn't very useful.

This image shows the results from regression performed with Population per distance as the independent variable. It is statistically significant with a value of .000 and the R square value is very high.

With all of these regression analyses being studied, it is safe to assume based on their R Square values that the only variables worth pursuing in this case study are the population / distance variables for each school. The other variables are significant statistically, but their R Square values are so low they have low explanatory power.

The next step was to re-run the regression analysis, this time including residuals. Residuals are basically the values that the model doesn't account for. These residuals are aggregated to the spreadsheet of data, so each county is given a residual value. High residual values essentially indicate outliers in the dataset, with values much higher than predicted in the model. Conversely, negative residual values indicate lower values than the model predicts. These values were mapped using ArcMap, the results are shown below.

Results:

This image shows the residuals from the relationship between variable: population normalized by counties' distance from UWEC, and UWEC enrollment from each county.

This image shows the residuals from the relationship between variable: population normalized by counties' distance from UW-Oshkosh, and UW-Oshkosh enrollment from each county.

Discussion:

The results of these maps show some interesting relationships. Because residuals show outliers in the datasets, the dark red areas in each map show areas that produce more students than would be expected based on their distance from each school. On the other hand, the light yellow areas show areas that send fewer students than the model predicts. Exploring the reasons why these high or low outliers are present can help to approach the UW-System's goal of analyzing students decisions on colleges.

Most notable in the first map for Eau Claire are Marathon, Dane, Brown, and Waukesha counties. These are the counties who's residual values are from 1.31 to 2.99. There are a number of counties that have very low residual values as well. (Or they can be seen as high negative values) This means that the model predicted a value much higher than the ones actually present. The highest values in this map seem to be areas that have high populations. Wausau, Green Bay, Madison, and some of Milwaukee have these high values. It is no surprise that places with high populations will send more students, and that when they are farther away, they send fewer. But these counties send uncharacteristically high amounts of students. Though there are no definite answers as to why this is, there are a number of possibilities. Wausau doesn't have a four-year school, and UW-Eau Claire is highly regarded by teachers and community members there. It also happens to be the closest four year school, even though it is about 1.5 hours away. Though Madison and Green Bay are a little bit farther, they still send very high numbers of students. UW-Green Bay is in Brown county, and may be seen as a lesser school than UW-Eau Claire. Students from Dane County may be coming to Eau Claire as an alternative to UW-Madison, which has strict admission requirements. Also, students may interested in going to a smaller school than the UW. An interesting comparison is between Waukesha county and Milwaukee county. Milwaukee county has the lowest residual value of all, and is right next to a county with one of the highest residual values. This could be due to Milwaukee's very high population, in conjunction with its own four year university, and distance from Eau Claire. Waukesha county on the other hand has a much lower population, being about the same distance from Eau Claire. It is likely that they have very similar numbers for enrollment, but population is the difference.

For the second map with UW-Oshkosh, the two highest values are Outigamie and Fon Du Lac counties. UW-Oshkosh is in Winnebago county, and it has counties with high populations on either side. Because they are so close and have high populations, they exceed the model's expected values. It is interesting also to note the areas with low enrollment for how populated/far away they are. Eau Claire, LaCrosse, Madison, Green Bay and Milwaukee are all in counties with low residuals. They all have universities in them, so this is no coincidence. Milwaukee again exceeds all other counties with the highest negative residual value. This could be due again to its high population and industrial nature in conjunction with its own University.

Conclusion:

Regression analysis is a powerful tool to use to examine relationships between variables, and using residual values in a spatial manner can express discrepancies in data very effectively. In this case, examining population/distance and enrollment showed a strong relationship, and a high power model outlining the relationship. As Population per Distance increases, so does enrollment. Mapping the residuals from this model allows for a spatial representation of outliers in the dataset, and provides important information to the UW-System about it's students' university selections.

Wednesday, April 8, 2015

Correlation and Spatial Autocorrelation

Part I: Correlation

a.

The first part of this lab was to determine whether this dataset had correlation between distance and sound level.

First, I used Excel to create a scatter plot of the data.

This is a scatter plot of the dataset including trend line.

Next, I went into SPSS to calculate the Pearson correlation.

The results of the Pearson Correlation test

The above chart shows that there is indeed a correlation between the two variables, distance and sound level. The negative value, shown in both the chart and in the downward slope on the graph indicates that the direction is negative, so when distance increases, sound level decreases. The Pearson Correlation statistic is -.896, indicating that the there is a very high strength associated with this relationship. (Rounded up to .9 because strength is categorized in just two decimal places.)

Hypothesis testing for this data goes as follows:

1) State Null Hypothesis: There is no association/ linear relationship between distance and decibal level

2) State Alternative Hypothesis: There is a significant association or linear relationship between them

3) This will be tested with Pearson Correlation test

4) α = .01 (99% significance) with two tails

5) Calculate Test Statistic: r=-.896

6) Make decision about hypotheses: Reject null hypothesis in favor of the alternative hypothesis. There is a high to very high negative linear relationship between distance and sound level.

b.

This mini-exercise included creating a correlation matrix for a number of variables in Milwaukee County, WI. This was done in SPSS, and the results are shown and interpreted below:

Correlation matrix for variables: Percent White, Percent Black, Percent Hispanic, Percent with no high school diploma, percent with bachelors degree, and percent that walk to work.

I'll start with Percent white as it relates to other variables. First, it has a high, negative relationship to the percent black variable. This is logical, as they are two unique percentages competing for space in a census tract. With this in mind, such a high strength relationship is more than just a coincidence. This means that where there are lots of white people, there are very few black people and vice-versa. The same is true for the relationship between percent white and percent hispanic, but the strength is much lower. This low of a number barely notes a relationship. This could just mean that a high percentage of white people in a given area, there isn't enough statistical space for a high percentage of hispanics, as mentioned above. Also, There is a moderate negative correlation between percentage of white people and percent of people with no high school diploma. This means that in areas with lots of white people, there are lower levels of people without diplomas. Next, percentage of white people has a moderate positive relationship with the bachelor's degree variable, so areas with lots of whites also yield many with college degrees. There is a high, negative linear relationship between percentage of white people and percent below the poverty line, so areas with high percentages of whites have few poor people. There is no significant correlation between percent white and percent walking.

Next, I'll reference percent black in relation to other variables. There is a significant, negative correlation between percent black and percent hispanic, but it is very low strength, as it was between whites and hispanics. Next, there is a positive, low strength linear relationship between percentage black and those without highschool diplomas. Areas with high percentages of blacks have high percentages of highschool dropouts, in other words. There is a moderate strength, negative relationship between percentage of blacks and percentage with a bachelors degree. There is another moderate relationship between percent black and percent below the poverty line, this time positive. This means that areas with lots of black people have statistically higher percent in poverty. Again, percent walking doesn't have a relationship.

Next, percent hispanic. Relationships with other races has already been discussed, so first I'll note the strong association between hispanics and lack of high school diplomas. They have a low strength, negative association with percentage receiving bachelors degrees, and a low strength positive linear relationship with percent below the poverty line. This means that areas with high levels of hispanics have high percentages of those without high school diplomas, low percentages of bachelors degrees, and moderately high poverty. There is no correlation with percent walking to work.

Next, percentages of those with no high school degree has a moderate strength negative relationship with those with bachelors degrees and a moderate strength positive one with those below the poverty line. This is logical, as it is expected that high school dropouts won't continue to college, and will make less money. Again, no relationship statistically with those who walk to work.

Percentage of people with a bachelors degree has a moderate strength negative relationship with percentage below the poverty line, so areas with lots of bachelors degrees has lower poverty, statistically.

Finally, the percentage of those below the poverty line does have a significant positive relationship with percentage of people walking to work (even though it is low strength). This means areas of high poverty has high percentages of people walking to work, which makes sense.

The general patterns here are unfortunate, but logical. The correlations present indicate that areas with high percentages of white people have low percentages of black and hispanic people, and less prevalence of poverty along with variables associated with it like lack of high school diplomas. This picture looks quite different for members of minorities. The data is almost flipped for percent black and Hispanic people: areas with high percentages of these minorities have high levels of poverty, low percentages of high school graduates, and logically, low percentages of college graduates.

These patterns reflect inequalities that are very real throughout the United States. Using a correlation matrix validates these often subtle inequalities, and demonstrates them in a way that can be scientifically studied. It shows the shortcomings of our social systems here in the US, as close to home as in Milwaukee.