The following is part of a on-going collection of Jupyter notebooks. The goal being to have a library of notebooks as an introduction to Mathematics and technology. These were all created by Gavin Waters. If you use these notebooks, be nice and throw up a credit somewhere on your page.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
Linear regression is a process that trys to model the interaction/relationshiop between two sets of data. This is done by creating a linear equation that best "fits" the relationship. I am not going to talk about how linear regression works, that is not the goal of this notebook.
But what I will do is create two sets and find a linear regression between them.
First create a random set of numbers and then the same number of points to relate these numbers to. For the purpose of example I am going to artificially create a relationship between the two sets of points.
YY = np.random.random(20)
X = np.linspace(0,10,20)
Y = 2*YY+X
plt.scatter(X,Y)
<matplotlib.collections.PathCollection at 0x11746d908>
the Stats package has quit a lot of tools you can use, for our example we will just call the linregess function to find everything we need to know about the linear regression bewtween these two sets.
LinReg = stats.linregress(X,Y)
print(LinReg)
LinReg[0]
LinregressResult(slope=0.97413179952641371, intercept=1.1956568957765246, rvalue=0.98487751106379229, pvalue=3.7190843392567215e-15, stderr=0.040390361541795157)
0.97413179952641371
You can use this information to plot the regression line onto the plot.
plt.scatter(X,Y)
plt.plot(X,LinReg[0]*X+LinReg[1])
[<matplotlib.lines.Line2D at 0x1150706a0>]
sns.regplot(X,Y,ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x117aa7710>
Notice there is a "ci" term. That stands for the confidence interval for the regression plot.
sns.regplot(X,Y,ci=78)
<matplotlib.axes._subplots.AxesSubplot at 0x117cca358>
We can plot a scatter plot of the residuals
sns.residplot(X,Y)
<matplotlib.axes._subplots.AxesSubplot at 0x117cfb7f0>
Because of the way we artificially created our example, this should be virutally the same as our original YY data set.
plt.scatter(X,YY)
<matplotlib.collections.PathCollection at 0x118146be0>
sns.jointplot(X,Y, kind="kde")
<seaborn.axisgrid.JointGrid at 0x1181510f0>
sns.kdeplot(X,Y,shade=True)
<matplotlib.axes._subplots.AxesSubplot at 0x118411f98>
For this example I am going to go out into the world and downlaod some data. I then store my data in the same directory or a subfolder of the directory I am in. I found the following:
Community Health Status Indicators (CHSI) to Combat Obesity, Heart Disease and Cancer.
Governmental data set from U.S. Department of Health & Human Services
Obtained from data.gov https://catalog.data.gov/
"ols" Stands for Ordinary Least Squares, which is the method used to fit our linear function to the data.
from pandas.stats.api import ols
dataY = pd.read_excel('chsi_dataset/CHSI DataSet.xls',sheetname="DEMOGRAPHICS")
This file came in lots of formats, I just used the excel to pull everything and then looked at the sheet I wanted. Unfortunately/fortunately when you look at large data sets you get alot of stuff. Below i am trying to figure out what I am looking at.
dataY.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3141 entries, 0 to 3140 Data columns (total 44 columns): State_FIPS_Code 3141 non-null int64 County_FIPS_Code 3141 non-null int64 CHSI_County_Name 3141 non-null object CHSI_State_Name 3141 non-null object CHSI_State_Abbr 3141 non-null object Strata_ID_Number 3141 non-null int64 Strata_Determining_Factors 3141 non-null object Number_Counties 3141 non-null int64 Population_Size 3141 non-null int64 Min_Population_Size 3141 non-null int64 Max_Population_Size 3141 non-null int64 Population_Density 3141 non-null int64 Min_Population_Density 3141 non-null int64 Max_Population_Density 3141 non-null int64 Poverty 3141 non-null float64 Min_Poverty 3141 non-null float64 Max_Poverty 3141 non-null float64 Age_19_Under 3141 non-null float64 Min_Age_19_Under 3141 non-null float64 Max_Age_19_Under 3141 non-null float64 Age_19_64 3141 non-null float64 Min_Age_19_64 3141 non-null float64 Max_Age_19_65 3141 non-null float64 Age_65_84 3141 non-null float64 Min_Age_65_84 3141 non-null float64 Max_Age_65_85 3141 non-null float64 Age_85_and_Over 3141 non-null float64 Min_Age_85_and_Over 3141 non-null float64 Max_Age_85_and_Over 3141 non-null float64 White 3141 non-null float64 Min_White 3141 non-null float64 Max_White 3141 non-null float64 Black 3141 non-null float64 Min_Black 3141 non-null float64 Max_Black 3141 non-null float64 Native_American 3141 non-null float64 Min_Native_American 3141 non-null float64 Max_Native_American 3141 non-null float64 Asian 3141 non-null float64 Min_Asian 3141 non-null float64 Max_Asian 3141 non-null float64 Hispanic 3141 non-null float64 Min_Hispanic 3141 non-null float64 Max_Hispanic 3141 non-null float64 dtypes: float64(30), int64(10), object(4) memory usage: 1.1+ MB
dataY = dataY[['Poverty','State_FIPS_Code','County_FIPS_Code','Population_Size','Age_65_84']]
Its always good to just take a quick look at what you pulled, that way you can easily spot mistakes if you pulled the wrong thing. head() look at the first 5 rows
dataY.head()
Poverty | State_FIPS_Code | County_FIPS_Code | Population_Size | Age_65_84 | |
---|---|---|---|---|---|
0 | 10.4 | 1 | 1 | 48612 | 9.8 |
1 | 10.2 | 1 | 3 | 162586 | 14.5 |
2 | 22.1 | 1 | 5 | 28414 | 11.6 |
3 | 16.8 | 1 | 7 | 21516 | 10.9 |
4 | 11.9 | 1 | 9 | 55725 | 12.1 |
We can throw up a quick scatter plot to see what the data is doing
plt.scatter(dataY.Poverty,dataY.Age_65_84)
<matplotlib.collections.PathCollection at 0x11d99c3c8>
Notice that point off on the left. There data is corrupted, the percentage of poverty for that data point is approximately -2000%, which is a silly mnumber. So lets restrict ourselves to just positive percentages
dataY = dataY[dataY.Poverty>=0]
plt.scatter(dataY.Poverty,dataY.Age_65_84)
<matplotlib.collections.PathCollection at 0x11dd4d048>
This graph looks better. The next thing I like to do is put up a kde plot, its like a heat map. It basically shows the density of the points smoothed out in countors
g = sns.kdeplot(dataY.Poverty,dataY.Age_65_84,shade=True)
g.set_xlim(0,35)
g.set_ylim(0,30)
(0, 30)
This looks like it might have a very weak or no correlation, lets do a regression
sns.regplot(dataY.Poverty,dataY.Age_65_84)
<matplotlib.axes._subplots.AxesSubplot at 0x11debd048>
ols(y=dataY.Poverty,x=dataY.Age_65_84)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 3140 Number of Degrees of Freedom: 2 R-squared: 0.0000 Adj R-squared: -0.0003 Rmse: 4.8840 F-stat (1, 3138): 0.0974, p-value: 0.7550 Degrees of Freedom: model 1, resid 3138 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0082 0.0261 -0.31 0.7550 -0.0594 0.0431 intercept 13.4545 0.3455 38.94 0.0000 12.7772 14.1317 ---------------------------------End of Summary---------------------------------
stats.linregress(dataY.Poverty,dataY.Age_65_84)
LinregressResult(slope=-0.0038033633506301401, intercept=12.840552576425711, rvalue=-0.0055699336836162331, pvalue=0.75504561779548907, stderr=0.012189463361527582)
From these results we can say with confidence that there does not seem to be a correlation between the percentage of old people in a county and the Poverty level of that county.
Same resource but I am going to look at obesity and smoking
dataX = pd.read_excel('chsi_dataset/CHSI DataSet.xls',sheetname="RISKFACTORSANDACCESSTOCARE")
dataX = dataX[['State_FIPS_Code','County_FIPS_Code','Obesity','Smoker']]
dataX = dataX[(dataX.Obesity>=0)&(dataX.Smoker>=0)] # Again, taking care of "blips"
To get a quick indication of what we are looking at
plt.scatter(dataX.Obesity,dataX.Smoker)
<matplotlib.collections.PathCollection at 0x11d93d978>
A heat map is always fun about now
g = sns.kdeplot(dataX.Obesity,dataX.Smoker,shade=True)
g.set_xlim(2,45)
g.set_ylim(0,50)
(0, 50)
Lets see a linear regression
sns.regplot(dataX.Obesity,dataX.Smoker)
<matplotlib.axes._subplots.AxesSubplot at 0x11d23e898>
And now for your linear regression test. First using Pandas
ols(x=dataX.Obesity,y=dataX.Smoker)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 2108 Number of Degrees of Freedom: 2 R-squared: 0.1712 Adj R-squared: 0.1708 Rmse: 5.0683 F-stat (1, 2106): 434.8750, p-value: 0.0000 Degrees of Freedom: model 1, resid 2106 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x 0.4841 0.0232 20.85 0.0000 0.4386 0.5296 intercept 11.7003 0.5754 20.34 0.0000 10.5726 12.8280 ---------------------------------End of Summary---------------------------------
And now using scipy
stats.linregress(dataX.Obesity,dataX.Smoker)
LinregressResult(slope=0.48407652227823772, intercept=11.700283669750053, rvalue=0.41370481483170812, pvalue=5.9802577013953361e-88, stderr=0.023213027902751975)
One could infer from this result that there is a very good correlation between smoking and Obesity in counties in the US