The following is part of a on-going collection of Jupyter notebooks. The goal being to have a library of notebooks as an introduction to Mathematics and technology. These were all created by Gavin Waters. If you use these notebooks, be nice and throw up a credit somewhere on your page.

Pandas - Basic

Pandas is a package that helps with data analysis. Each data-frame has columns and rows.

The above code calls the pandas package along with the OS package and a plotting package.

What we are going to look at is all of the names that the social security have given out since 1880. This is freely available from data.gov

Load the file, by just reading it into a data frame.

You should notice that the file is a text file, but pandas read it as a csv.

To look at the data, we can look at the head or tail of the data frame

As you can see the data has an index and columns, unfortunately column "0" or column "1"" is not really informative. We can assign new names to these columns

As you can see, these are the names from the file names/yob1880.txt. In the names folder we have yobyyyy.txt for every year up to 2015. If we load all that information into the data frame, we will not know which name is associated with which year, so we will want to assign the year 1880 to these particular names and numbers.

Next step building onto our Data Frame

Lets load up year 1881 and create a new data frame and call it dataY

So I have my new data frame and I want to join dataX with dataY. So long as they have the same columns along with column names then this is easy. I can just concatenate.

Whats an important thing to note is that the index is not correct now, I have just joined the two data frames one on top of each other. We can illustrate this by finding the length of the data frame and notice that it is not the same as the tail.

This can be rectified by reseting the index. We can call the tail() to see if it worked.

These methods are good if you are looking to join one or two data files, but we have over a hundred. This will lead to multiple problems, not the least we might run out of names to call out data frames.

What the next piece of code does is goes to the "names" folder, looks for ".txt" files. Then it methodically works through thoses files, asseses the name, parses out the date of the file and adds it to a dataframe and then concates the new dataframe to the dataframe that is in memory.

For this to work we care going to start with a dataframe, since I already have the code made, we first load in the original dataframe.

Next, I write a loop that loads one at a time the other dataframes and concatenates them to my original data frame dataX

This should contain all of the names now. We can check the length of the data frame and also look at the end of the file.

The one issue that I have is the original data frame got counted twice, so I would like to remove duplicate enteries. Remember I will not remove duplicate names since the "year" will be different.

We can quickly garner information from this data frame.

Of course it would be nice to find that name and year.

This is one of the best things about pandas, it allows you to quickly slice and dice your data frame.

So if I call a data frame dataX[some condition] then it only show me the values that meet that condition. You can also create sub-data frames by equating it to a different name. You can also reduce the data frame by equating it to itself.

Seems like 1947 was a good year for James and Linda

How many different male and female names have there been?

Maybe, Males are less likely to be chosen to obtain a SSN number?

Nope, that is about even. Lets take a look at the trend of James and Linda.

We can look at the number of names registered each year, by using a command called groupby

This plainly shows that the number of new names registered with the IRS happened at about 1950's. We can find out the maximum easily

This is what commonly is called the Baby boom

I am going to bring in another file, one that shows US population from 1900 until 2016

What we want to do now is put the population for each year as another column on our large dataX set. First we can make a dictionary and then map the year in the dataX dataframe to the dictionary. Since we do not have numbers for 1880-1899, we shall leave out that part of the dataX set. This method of setting up a dictionary and using pandas is much much quicker than writing a loop.

This set unfortunately is only from 1900 to 2016, so we need to restrict our dataX and then map our dictionary onto our dataX. Next I find out what population percentage is each name registration

This again will give us an estimate of the popularity of the name compared to the population at that time.

What this population percentage number can also do is show us the growth rate of the population, by simply adding up all of the PopPercent for each name

People always talk about the baby boom. It is visible from the above graph, in 1947-ish the growth rate of the population is double of what it is today.