Wednesday, April 29, 2015

Regression Analysis

Part I: 


This portion of the lab introduces a scenario in which there is a study conducted on crime rates and poverty in a town. The press latches onto a statistic that claims that as the number of kids eating free lunch increase, so does crime. The object of this section is to determine if this has any validity, and determine what percentage of kids on free lunch would yield a crime rate of 79.7.

A table was provided including two columns for each variable: Percent Free Lunch and Crime Rate. SPSS was used to perform linear regression analysis to better understand the relationship between them, with percent free lunch as the independent variable, and crime rate as the dependent variable. The data was returned as follows:

The above images show the SPSS output for a linear regression analysis on percent free lunch and crime rate variables. Based on the significance value on the bottom right of the bottom table (.005) the null hypotheses that there is no linear relationship between the two variables should be rejected in favor of the alternative hypothesis, which states that there is a linear relationship between the variables because it is outside of the 95% significance bounds of .05.  However, with such a low R Square statistic, this relationship is not very strong, and it's predictive value is very limited. 
 The resulting data equation from the above output can be put into the regression equation of y=a+bx
Doing this would result in the equation: Y = 21.819 + 1.685x
To calculate the percentage of persons getting free lunch with a crime rate of 79.7, the equation can be transformed as follows:
79.9 = 21.819 + 1.685x
x = (79.9-21.819)/1.685
x = 35.35%
With a crime rate of 79.7, the percentage of kids getting free lunch would be about 35%. It is important to keep in mind, though, that since the R Square statistic is very low, this value cannot be relied on to be very accurate!

Part II: Spatial Auto-correlation

Introduction:


This portion of the exercise included a scenario in which the UW System has requested an analysis of enrollment with respect to a number of variables. These include distance from each school, percent with a bachelor's degree, and median household income. This analysis can be done using linear regression analysis on these variables with the help of SPSS statistics editor. The output from the regression analyses combined with spatial representations of the models' residuals can provide input into why students choose the schools they go to.

Methods:


Data was provided by the professor in the form of a spreadsheet including enrollment data from all UW schools, and some census statistics. The next step was to create a new column as population divided by distance. This is a way to normalize the data to ensure that just because a county has a higher population, the number of students enrolled won't be off the charts. Conversely, counties right next to a school won't be unreasonably favored (because more students will attend closer colleges).

After this column was created and calculated in Excel, regression analysis was performed on these variables with relation to two schools: UW-Eau Claire, and UW- Oshkosh.

Eau Claire:


This image shows the results of the regression analysis with median household income as the independent variable, and UWEC enrollment as the dependent variable. With a significance of .104, you would fail to reject the null hypothesis which states that there is not a linear relationship between the two variables.

This image shows the results with Percent with a Bachelor's degree as the independent variable and Eau Claire enrollment as the dependent. This significance is below .05, so you would reject the null hypothesis (same as example above) in favor of the alternative hypothesis which states that there is a linear relationship between the two variables. In this output, the R square value is very low, so the explanatory power of the model is not ideal.

This image shows the results from regression with Population per distance as the independent variable, and UWEC enrollment as the dependent. The significance is .000 so you would reject the null hypothesis in favor of the alternative hypothesis in the same manner as the above output. The difference here, is that the R Square value is much higher. The number of .945 essentially means that entries would fit the model 94.5% of the time. This is a very high strength, positive relationship.
This image shows the results from regression performed on median household income and UW-Oshkosh's enrollment. This relationship is statistically significant, but the R Square value is low. 

This image shows the results from regression analysis performed on Percent with a bachelor's degree as the explanitory variable and Oshkosh enrollment as the dependent variable. Again, the relationship is statistically significant, but the R Square value is low, so the model isn't very useful.

This image shows the results from regression performed with Population per distance as the independent variable. It is statistically significant with a value of .000 and the R square value is very high. 
With all of these regression analyses being studied, it is safe to assume based on their R Square values that the only variables worth pursuing in this case study are the population / distance variables for each school. The other variables are significant statistically, but their R Square values are so low they have low explanatory power.

The next step was to re-run the regression analysis, this time including residuals. Residuals are basically the values that the model doesn't account for. These residuals are aggregated to the spreadsheet of data, so each county is given a residual value. High residual values essentially indicate outliers in the dataset, with values much higher than predicted in the model. Conversely, negative residual values indicate lower values than the model predicts. These values were mapped using ArcMap, the results are shown below.

Results:


This image shows the residuals from the relationship between variable: population normalized by counties' distance from UWEC, and UWEC enrollment from each county. 

This image shows the residuals from the relationship between variable: population normalized by counties' distance from UW-Oshkosh, and UW-Oshkosh enrollment from each county. 

Discussion:


The results of these maps show some interesting relationships. Because residuals show outliers in the datasets, the dark red areas in each map show areas that produce more students than would be expected based on their distance from each school. On the other hand, the light yellow areas show areas that send fewer students than the model predicts. Exploring the reasons why these high or low outliers are present can help to approach the UW-System's goal of analyzing students decisions on colleges.

Most notable in the first map for Eau Claire are Marathon, Dane, Brown, and Waukesha counties. These are the counties who's residual values are from 1.31 to 2.99. There are a number of counties that have very low residual values as well. (Or they can be seen as high negative values) This means that the model predicted a value much higher than the ones actually present. The highest values in this map seem to be areas that have high populations. Wausau, Green Bay, Madison, and some of Milwaukee have these high values. It is no surprise that places with high populations will send more students, and that when they are farther away, they send fewer. But these counties send uncharacteristically high amounts of students. Though there are no definite answers as to why this is, there are a number of possibilities. Wausau doesn't have a four-year school, and UW-Eau Claire is highly regarded by teachers and community members there. It also happens to be the closest four year school, even though it is about 1.5 hours away. Though Madison and Green Bay are a little bit farther, they still send very high numbers of students. UW-Green Bay is in Brown county, and may be seen as a lesser school than UW-Eau Claire. Students from Dane County may be coming to Eau Claire as an alternative to UW-Madison, which has strict admission requirements. Also, students may interested in going to a smaller school than the UW. An interesting comparison is between Waukesha county and Milwaukee county. Milwaukee county has the lowest residual value of all, and is right next to a county with one of the highest residual values. This could be due to Milwaukee's very high population, in conjunction with its own four year university, and distance from Eau Claire. Waukesha county on the other hand has a much lower population, being about the same distance from Eau Claire. It is likely that they have very similar numbers for enrollment, but population is the difference.

For the second map with UW-Oshkosh, the two highest values are Outigamie and Fon Du Lac counties. UW-Oshkosh is in Winnebago county, and it has counties with high populations on either side. Because they are so close and have high populations, they exceed the model's expected values. It is interesting also to note the areas with low enrollment for how populated/far away they are. Eau Claire, LaCrosse, Madison, Green Bay and Milwaukee are all in counties with low residuals. They all have universities in them, so this is no coincidence. Milwaukee again exceeds all other counties with the highest negative residual value. This could be due again to its high population and industrial nature in conjunction with its own University.

Conclusion:


Regression analysis is a powerful tool to use to examine relationships between variables, and using residual values in a spatial manner can express discrepancies in data very effectively. In this case, examining population/distance and enrollment showed a strong relationship, and a high power model outlining the relationship. As Population per Distance increases, so does enrollment. Mapping the residuals from this model allows for a spatial representation of outliers in the dataset, and provides important information to the UW-System about it's students' university selections.


Wednesday, April 8, 2015

Correlation and Spatial Autocorrelation

Part I: Correlation

a.

The first part of this lab was to determine whether this dataset had correlation between distance and sound level. 
First, I used Excel to create a scatter plot of the data.

This is a scatter plot of the dataset including trend line.
Next, I went into SPSS to calculate the Pearson correlation.

The results of the Pearson Correlation test
The above chart shows that there is indeed a correlation between the two variables, distance and sound level. The negative value, shown in both the chart and in the downward slope on the graph indicates that the direction is negative, so when distance increases, sound level decreases. The Pearson Correlation statistic is -.896, indicating that the there is a very high strength associated with this relationship. (Rounded up to .9 because strength is categorized in just two decimal places.)

Hypothesis testing for this data goes as follows:
1) State Null Hypothesis: There is no association/ linear relationship between distance and decibal level
2) State Alternative Hypothesis: There is a significant association or linear relationship between them
3) This will be tested with Pearson Correlation test
4) α  = .01 (99% significance) with two tails
5) Calculate Test Statistic: r=-.896
6) Make decision about hypotheses: Reject null hypothesis in favor of the alternative hypothesis. There is a high to very high negative linear relationship between distance and sound level.

b.

This mini-exercise included creating a correlation matrix for a number of variables in Milwaukee County, WI. This was done in SPSS, and the results are shown and interpreted below:

Correlation matrix for variables: Percent White, Percent Black, Percent Hispanic, Percent with no high school diploma, percent with bachelors degree, and percent that walk to work. 

I'll start with Percent white as it relates to other variables. First, it has a high, negative relationship to the percent black variable. This is logical, as they are two unique percentages competing for space in a census tract. With this in mind, such a high strength relationship is more than just a coincidence. This means that where there are lots of white people, there are very few black people and vice-versa. The same is true for the relationship between percent white and percent hispanic, but the strength is much lower. This low of a number barely notes a relationship. This could just mean that a high percentage of white people in a given area, there isn't enough statistical space for a high percentage of hispanics, as mentioned above. Also, There is a moderate negative correlation between percentage of white people and percent of people with no high school diploma. This means that in areas with lots of white people, there are lower levels of people without diplomas. Next, percentage of white people has a moderate positive relationship with the bachelor's degree variable, so areas with lots of whites also yield many with college degrees. There is a high, negative linear relationship between percentage of white people and percent below the poverty line, so areas with high percentages of whites have few poor people. There is no significant correlation between percent white and percent walking. 

Next, I'll reference percent black in relation to other variables. There is a significant, negative correlation between percent black and percent hispanic, but it is very low strength, as it was between whites and hispanics. Next, there is a positive, low strength linear relationship between percentage black and those without highschool diplomas. Areas with high percentages of blacks have high percentages of highschool dropouts, in other words. There is a moderate strength, negative relationship between percentage of blacks and percentage with a bachelors degree. There is another moderate relationship between percent black and percent below the poverty line, this time positive. This means that areas with lots of black people have statistically higher percent in poverty. Again, percent walking doesn't have a relationship.

Next, percent hispanic. Relationships with other races has already been discussed, so first I'll note the strong association between hispanics and lack of high school diplomas. They have a low strength, negative association with percentage receiving bachelors degrees, and a low strength positive linear relationship with  percent below the poverty line. This means that areas with high levels of hispanics have high percentages of those without high school diplomas, low percentages of bachelors degrees, and moderately high poverty. There is no correlation with percent walking to work.

Next, percentages of those with no high school degree has a moderate strength negative relationship with those with bachelors degrees and a moderate strength positive one with those below the poverty line. This is logical, as it is expected that high school dropouts won't continue to college, and will make less money. Again, no relationship statistically with those who walk to work. 

Percentage of people with a bachelors degree has a moderate strength negative relationship with percentage below the poverty line, so areas with lots of bachelors degrees has lower poverty, statistically. 

Finally, the percentage of those below the poverty line does have a significant positive relationship with percentage of people walking to work (even though it is low strength). This means areas of high poverty has high percentages of people walking to work, which makes sense. 

The general patterns here are unfortunate, but logical. The correlations present indicate that areas with high percentages of white people have low percentages of black and hispanic people, and less prevalence of poverty along with variables associated with it like lack of high school diplomas. This picture looks quite different for members of minorities. The data is almost flipped for percent black and Hispanic people: areas with high percentages of these minorities have high levels of poverty, low percentages of high school graduates, and logically, low percentages of college graduates.

These patterns reflect inequalities that are very real throughout the United States. Using a correlation matrix validates these often subtle inequalities, and demonstrates them in a way that can be scientifically studied. It shows the shortcomings of our social systems here in the US, as close to home as in Milwaukee.


Part II: Spatial Auto-correlation

Introduction:

This exercise includes a scenario in which students were assigned the role of analyzing election data for the 1980 and 2008 Presidential Elections in the state of Texas. The Texas Election Commission needs assistance in analyzing voter patterns for democratic and Hispanic voter data. The goal is to be able to tell the governor whether the election patterns have changed or not in these 20+ years.  This can be done using a number of statistical tools, and with the assistance of ArcGIS, GeoDa and SPSS software.

The exercise uses correlation, which tests relationships among variables. It yields strength and direction of possible linear relationships present between variables, but does NOT imply causation. This exercise also relies on spatial auto-correlation, which tests correlation of a single variable over space. This can be a powerful tool for studying clustering patterns.

Methods:


First, data needed to be downloaded. Data on Hispanic population, and a Texas shapefile were downloaded using the U.S. Census Bureau's fact finder application. This application is often difficult to navigate, and acquiring data was time consuming. Once all the data was downloaded, Esri ArcGIS was used to join the Hispanic data, and another provided Texas voter data table to the Texas shapefile using the GEO_ID field. 

Once all of the data was properly compiled and joined, it was imported into GeoDa, which is a geostatistical program useful for performing spatial auto-correlation. In order to do this, a weight file had to be created, before using the Moran's I statistic and LISA cluster maps. Moran's I is a test that compares values, ultimately assigning the data a value between +/- 1. Higher values indicate stronger clustering, or "anti-clustering." LISA cluster maps are essentially a mapped version of this, denoting alike neighboring areas and unlike neighboring areas. The cluster maps and Moran's I results are shown below in the Results section. 


Results:


Variable I: Percent hispanic as a percentage of total population. The map below shows areas of clustering with high percentages of hispanics versus areas with low percentages. As the legend notes, areas with high percentages of hispanics surrounded by other areas of this nature are displayed in red, where areas with low percentages surrounded by other low percentage areas are displayed in blue. Low percentage areas surrounded by high percentage areas are shown in light blue and areas with high percentages surrounded by low percentages are shown in light red. All subsequent LISA clustering maps will have this color scheme. 

LISA Clustering map for percent hispanic by county

Legend

Moran's I test output. This relatively high value of .78 equates to high clustering of areas with high percentages of Hispanics or high clustering of low - low neighbors. 

Variable II: Percent Democratic Vote 1980

LISA clustering map of percent democratic vote in 1980. The southern area of the state seems to have significant clustering, along with various other areas in the East. It looks like San Antonio is an outlier here, the light red county in the south central portion of the map. As an urban area surrounded by rural, conservative areas, it is no surprise that it is identified this way.  
Legend

This variable's Moran's I value was significantly less powerful but still indicates a moderate positive spatial auto-correlation. 

Variable III: Percent Democratic Vote 2008
LISA cluster map of counties by percentage of democratic vote in 2008. It appears as though there has been some change since 1980. Namely, it appears as though there is more significant clustering present. 


This Moran's I test yielded a higher value than in 1980, this affirms my previous judgement about increased clustering

Variable IV: 1980 Percent Voter Turnout

This image shows cluster map for 1980 voter turnout percentages. There are clustering of counties with low percentages in the southern part of the state, which could be due to the high percentages of hispanics in that area. 


Legend
This Moran's I value is rather low, but still notes the presence of some clustering.

Variable V: 2008 Percent Voter Turnout

Percent voter turnout cluster map for 2008. Areas with clustering of high turnout seem to have shifted to the west/central part of the state since 1980.  

Legend

This is the lowest Moran's I yet, and indicates less clustering of voter turnout since 1980. 


Discussion:


The results of these LISA and Moran's I tests do provide some relevant information to the question posed by the Texas Election Commission. The Moran's I tests indicate that clustering exists for all five variables tested. The LISA maps are useful in demonstrating this spatially. There seems to be similar clustering patterns between some of these variables. For example, it looks like areas clustered with high Hispanic populations also have high percentages of democratic vote. Also, these areas seem to have low voter turnout. With these possible relationships in mind, further research is necessary. I performed a Pearson's Correlation analysis on the variables to shed some light into the possibility of linear relationships between these variables. The results are shown below. 

This correlation matrix shows the five above variable in the order they were displayed. There are a number of associations here. Generally, areas with high percentages of Hispanics have high percent democratic vote. These areas also have low voter turnout in each year, 1980 and 2008.  There are positive relationships as well between the variables that have entries in 1980 and 2008. This makes sense, as these areas would most likely not change drastically in that time frame. Another interesting correlation, is the negative relationship present between the democratic voters and voter turnout. This means that areas with high high of democratic voters have low percentage of turnout. 
After studying these correlations, it is safe to assume that they are all related. The correlation between Hispanics and democratic vote points to the fact that Hispanics often vote democrat. This can be attributed to the democratic party's commitment to social equity programs, favorable immigration policies, and even some progress in the way of reform. With this in mind, areas with high Hispanic populations also have low turnout. This can be explained by many Hispanic peoples' unwillingness or inability to vote- with many of them being undocumented. Additionally it is a familiar phenomenon that areas with high democratic vote also have low turnout. Republicans usually have better turnout due to voter restriction laws designed to attack groups that might normally vote democrat.

All of these variables lead to the conclusion that the Hispanic population does indeed affect the voting patterns in Texas. Although there is no data available for the Hispanic population in 1980, it is safe to assume that their influx has increased the democratic vote in the state of Texas. Particularly along the border with Mexico, clustering is present. This information could be useful for the TEC to refine voter advertisement techniques, increase/decrease accessibility for hispanic voters, or simply better understand the demographic makeup of their state.  

Wednesday, April 1, 2015

Z tests, T tests and Chi squared tests

Part I: Tests




1.
Fill out the chart above!
α = Significance level
z or t = is it a z or t test
z or t Value = Critical Value

A.
α = .1
Use Z test
z vals = 1.64, -1.64

B.
α=.05
Use t-test
t-vals = 2.59, -2.59

C.
α=.05
Use z-test
z-val = 1.64

D.
α=.01
Use z test
z vals = 2.57, -2.57

E.
α=.2
Use z-test
z-val = .84

F.
α=.01
use t-test
t val = 2.82

G.
α= .01
use t-test
t-vals = 3.33, -3.33


2. A Department of the interior in Washington D.C. estimates that the number of particular invasive species in a certain county (Bucks County) should number as follows (averages based on data from the whole state of Pennsylvania) per acre: Asian-Long Horned Beetle, 4; Emerald Ash Borer Beetle, 10; and Golden Nematode, 75.  A survey of 50 fields had the following results: (10 pts)

μ  σ
Asian-Long Horned Beetle 3.2 0.73
Emerald Ash Borer Beetle 11.7 1.3
Golden Nematode 77 5.71

a. Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
b. Be sure to present the null and alternative hypotheses for each as well as conclusions


Asian-Long Horned Beetle
1) State Null Hypothesis: There is no difference between the Asian-Long Horned Beetle population in Bucks county compared to the expected values from Pennsylvania
2) State Alternative Hypothesis: There is an expected difference between this sample’s population and the expected figure.
3) This hypothesis test will use a z-test because the sample size is above 30 (50)
4) α  = .05 (95% significance)
5) Calculate Test Statistic: Z-value= -7.7491, well outside critical values (-1.96, 1.96)
6) Make decision about hypotheses: Reject null hypothesis in favor of the alternative hypothesis. Asian Long Horned Beetle population is significantly lower than the expected value

Emerald Ash Borer Beetle
1) State Null Hypothesis: There is no difference between the Emerald Ash Borer Beetle population in Bucks county compared to the expected values from Pennsylvania
2) State Alternative Hypothesis: There is an expected difference between this sample’s population and the expected figure.
3) This hypothesis test will use a z-test because the sample size is above 30 (50)
4) α  = .05 (95% significance)
5) Calculate Test Statistic: Z-value= 9.2468 which is outside of critical values (-1.96, 1.96)
6) Make decision about hypotheses: Reject null hypothesis in favor of the alternative hypothesis. Emerald Ash Borer Beetle population is significantly higher than the expected value

Golden Nematode
1) State Null Hypothesis: There is no difference between the Golden Nematode population in Bucks county compared to the expected values from Pennsylvania
2) State Alternative Hypothesis: There is an expected difference between this sample’s population and the expected figure.
3) This hypothesis test will use a z-test because the sample size is above 30 (50)
4) α  = .05 (95% significance)
5) Calculate Test Statistic: Z-value= 2.48, well outside critical values (-1.96, 1.96)
6) Make decision about hypotheses: Reject null hypothesis in favor of the alternative hypothesis. Golden Nematode population is significantly lower than the expected value

c. What can ascertained pertaining to the findings about these invasive species in Buck County?

With these results in mind, it is safe to assume that Buck County has an invasive species problem, with multiple species exceeding their expected presence. The Long-horned beetle was significantly less than expected, though.




3. An exhaustive survey of all users of a wilderness park taken in 1960 revealed that the average number of persons per party was 2.1.  In a random sample of 25 parties in 1985, the average was 3.4 persons with a standard deviation of 1.32 (one tailed test, 95% Con. Level) 

a. Test the hypothesis that the number of people per party has changed in the intervening years.  (State null and alternative hypotheses)
b. What is the corresponding probability value

a.
1) State Null Hypothesis: There is no difference between the average number of persons per party between the years of 1960 and 1985
2) State Alternative Hypothesis: There is an expected difference between the number of people per party between these years.
3) This hypothesis test will use a t-test because the sample size is below 30 (25)
4) α  = .05 (95% significance) one-tailed.
5) Calculate Test Statistic: t-value= 4.92424, well outside critical value of 1.714)
6) Make decision about hypotheses: Reject null hypothesis in favor of the alternative hypothesis. Persons per party has increased significantly between the year of 1960 to 1985

b. 95% probability of this being the case

Part II. 'Up North' Study

Introduction: 

This exercise was designed to introduce students to the concepts associated with Chi Square testing, providing a scenario in which we were asked to assume the role of a research consultant on the concept of 'up north.' This involves using Chi Square testing and a series of maps to examine a number of variables. WI SCORP (Statewide Comprehensive Outdoor Recreation Plan) data provides information on recreation variables for each county in Wisconsin. Chi Squred tests are used to evaluate the relationships between these variables as they apply to Northern and Southern Counties. 

Methods:

First, I downloaded Wisconsin Counties data. This was available already on our UW-Eau Claire ArcGIS drive, so I imported it into a new geodatabase. Next, I had to determine how to define North vs. South for these counties. We were advised to use Highway 29 as a reference, so I imported a Major Roads dataset, queried for Hwy 29, and created a new feature class for it. I overlaid this on my counties dataset, then selected the counties I wanted to categorize. 

Next, I added the SCORP table to the GIS, and joined it to my counties feature class. I then added a field to the joined feature class specifying North vs. South for each county. This is shown below in the Results section. Next, I selected a number of variables that I thought could demonstrate the difference between northern and southern Wisconsin. I chose number of Bike Trails, ATV Trails and Acreage of Parks. For each variable, I created a new field and used the select by attribute tool along with the field calculator to populate my new field. I divided the range of values for each variable into four, then subtracted the fourth from the total to yield a linear breaks categorization from 1-4. I exported the feature class's table as a dBASE Table for use in SPSS to calculate statistics. 

I used SPSS' crosstabs statistics to calculate Chi Squared Tests for each variable. The results of these tests are shown below in the results section. 

Results:


This is the final map created using ESRI ArcGIS. The classification method used affects the display significantly, as shown by the two trail maps. 
Chi Square Test for distribution of Parks. 

Crosstabulation for Parks Distribution.
Chi Square Test for Bike Trails

Crosstabulation for Bike Trails
Chi Square Test for ATV Trails

Crosstabulation for ATV Trails

Discussion:


The results of the Chi Square Tests are not what I anticipated. Refer to the 'Asymp. Sig. Pearson Chi-Square' number: this is the test statistic which is to be compared with the critical value of .05 (for a one-tailed test with 95% significance). As you can see, only the ATV Trails data set's test statistic is below this. This means that you would reject the null hypothesis which states that Northern and Southern WI's distribution of ATV Trails are statistically identical, in favor of the alternative hypothesis which states that there is indeed a difference. For bike trails and park acreage though, there is no statistical difference because the test statistic is below .05, so you would fail to reject the null hypothesis. 

With this in mind, it becomes obvious that further research into the idea of 'up north' is necessary, and more pertinent variables should be selected to highlight the differences between northern and southern WI. Another interesting thought is that while I expected the bike trails variable to be more prevalent in northern WI, it really was more significant in the southern part of the state. In fact, it almost was significant with a test statistic of just .127. 

Another important lesson from this exercise is that it is important to select a break method that accurately reflects the data. I used a linear break method, resulting in many counties with the same values, and little variation. Since linear break method does not account for outliers, the counties with high numbers of bike/ATV trails stood out and were basically the only members of their range of values. Perhaps in the future, a natural breaks method or standard deviation method would provide a more informative visualization.