The following is part of a on-going collection of Jupyter notebooks. The goal being to have a library of notebooks as an introduction to Mathematics and technology. These were all created by Gavin Waters. If you use these notebooks, be nice and throw up a credit somewhere on your page.

Linear regression and pretty plots

Linear regression is a process that trys to model the interaction/relationshiop between two sets of data. This is done by creating a linear equation that best "fits" the relationship. I am not going to talk about how linear regression works, that is not the goal of this notebook.

But what I will do is create two sets and find a linear regression between them.

First create a random set of numbers and then the same number of points to relate these numbers to. For the purpose of example I am going to artificially create a relationship between the two sets of points.

the Stats package has quit a lot of tools you can use, for our example we will just call the linregess function to find everything we need to know about the linear regression bewtween these two sets.

You can use this information to plot the regression line onto the plot.

Notice there is a "ci" term. That stands for the confidence interval for the regression plot.

We can plot a scatter plot of the residuals

Because of the way we artificially created our example, this should be virutally the same as our original YY data set.

A larger example using Pandas

Example one

For this example I am going to go out into the world and downlaod some data. I then store my data in the same directory or a subfolder of the directory I am in. I found the following:

Community Health Status Indicators (CHSI) to Combat Obesity, Heart Disease and Cancer.

Governmental data set from U.S. Department of Health & Human Services

Obtained from data.gov https://catalog.data.gov/

"ols" Stands for Ordinary Least Squares, which is the method used to fit our linear function to the data.

This file came in lots of formats, I just used the excel to pull everything and then looked at the sheet I wanted. Unfortunately/fortunately when you look at large data sets you get alot of stuff. Below i am trying to figure out what I am looking at.

Its always good to just take a quick look at what you pulled, that way you can easily spot mistakes if you pulled the wrong thing. head() look at the first 5 rows

We can throw up a quick scatter plot to see what the data is doing

Notice that point off on the left. There data is corrupted, the percentage of poverty for that data point is approximately -2000%, which is a silly mnumber. So lets restrict ourselves to just positive percentages

This graph looks better. The next thing I like to do is put up a kde plot, its like a heat map. It basically shows the density of the points smoothed out in countors

This looks like it might have a very weak or no correlation, lets do a regression

From these results we can say with confidence that there does not seem to be a correlation between the percentage of old people in a county and the Poverty level of that county.

Example two

Same resource but I am going to look at obesity and smoking

To get a quick indication of what we are looking at

A heat map is always fun about now

Lets see a linear regression

And now for your linear regression test. First using Pandas

And now using scipy

One could infer from this result that there is a very good correlation between smoking and Obesity in counties in the US