The following is part of a on-going collection of Jupyter notebooks. The goal being to have a library of notebooks as an introduction to Mathematics and technology. These were all created by Gavin Waters. If you use these notebooks, be nice and throw up a credit somewhere on your page.
Pandas is a package that helps with data analysis. Each data-frame has columns and rows.
import pandas as pd
import os, fnmatch
import matplotlib.pyplot as plt
%matplotlib inline
The above code calls the pandas package along with the OS package and a plotting package.
What we are going to look at is all of the names that the social security have given out since 1880. This is freely available from data.gov
Load the file, by just reading it into a data frame.
dataX = pd.read_csv('names/yob1880.txt',header=None)
You should notice that the file is a text file, but pandas read it as a csv.
To look at the data, we can look at the head or tail of the data frame
dataX.head()
0 | 1 | 2 | |
---|---|---|---|
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
As you can see the data has an index and columns, unfortunately column "0" or column "1"" is not really informative. We can assign new names to these columns
dataX.columns =('Name','Sex','number')
dataX.head()
Name | Sex | number | |
---|---|---|---|
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
As you can see, these are the names from the file names/yob1880.txt. In the names folder we have yobyyyy.txt for every year up to 2015. If we load all that information into the data frame, we will not know which name is associated with which year, so we will want to assign the year 1880 to these particular names and numbers.
dataX['Year']=int(1880)
dataX.head()
Name | Sex | number | Year | |
---|---|---|---|---|
0 | Mary | F | 7065 | 1880 |
1 | Anna | F | 2604 | 1880 |
2 | Emma | F | 2003 | 1880 |
3 | Elizabeth | F | 1939 | 1880 |
4 | Minnie | F | 1746 | 1880 |
Lets load up year 1881 and create a new data frame and call it dataY
dataY=pd.read_csv('names/yob1881.txt',header=None)
dataY.columns =('Name','Sex','number')
dataY['Year']=int(1881)
dataY.head()
Name | Sex | number | Year | |
---|---|---|---|---|
0 | Mary | F | 6919 | 1881 |
1 | Anna | F | 2698 | 1881 |
2 | Emma | F | 2034 | 1881 |
3 | Elizabeth | F | 1852 | 1881 |
4 | Margaret | F | 1658 | 1881 |
So I have my new data frame and I want to join dataX with dataY. So long as they have the same columns along with column names then this is easy. I can just concatenate.
dataXY = pd.concat([dataX,dataY])
print('The top of my new data frame')
print(dataXY.head(3))
print('and the bottom of my new data frame')
print(dataXY.tail(3))
The top of my new data frame Name Sex number Year 0 Mary F 7065 1880 1 Anna F 2604 1880 2 Emma F 2003 1880 and the bottom of my new data frame Name Sex number Year 1932 Wing M 5 1881 1933 Wood M 5 1881 1934 Wright M 5 1881
Whats an important thing to note is that the index is not correct now, I have just joined the two data frames one on top of each other. We can illustrate this by finding the length of the data frame and notice that it is not the same as the tail.
len(dataXY)
3935
This can be rectified by reseting the index. We can call the tail() to see if it worked.
dataXY=dataXY.reset_index(drop=True)
dataXY.tail()
Name | Sex | number | Year | |
---|---|---|---|---|
3930 | Wiliam | M | 5 | 1881 |
3931 | Wilton | M | 5 | 1881 |
3932 | Wing | M | 5 | 1881 |
3933 | Wood | M | 5 | 1881 |
3934 | Wright | M | 5 | 1881 |
These methods are good if you are looking to join one or two data files, but we have over a hundred. This will lead to multiple problems, not the least we might run out of names to call out data frames.
What the next piece of code does is goes to the "names" folder, looks for ".txt" files. Then it methodically works through thoses files, asseses the name, parses out the date of the file and adds it to a dataframe and then concates the new dataframe to the dataframe that is in memory.
For this to work we care going to start with a dataframe, since I already have the code made, we first load in the original dataframe.
dataX=pd.read_csv('names/yob1880.txt',header=None)
dataX.columns =('Name','Sex','number')
dataX['Year']=int(1880)
Next, I write a loop that loads one at a time the other dataframes and concatenates them to my original data frame dataX
files=os.listdir(os.path.abspath("names"))
for filename in fnmatch.filter(files, "*.txt"):
dataY=pd.read_csv('names/'+filename,header=None)
dataY.columns =('Name','Sex','number')
dataY['Year']=int(filename[3:7])
dataX = pd.concat([dataX,dataY])
This should contain all of the names now. We can check the length of the data frame and also look at the end of the file.
print(len(dataX))
dataX.tail()
1860689
Name | Sex | number | Year | |
---|---|---|---|---|
32947 | Zykell | M | 5 | 2015 |
32948 | Zyking | M | 5 | 2015 |
32949 | Zykir | M | 5 | 2015 |
32950 | Zyrus | M | 5 | 2015 |
32951 | Zyus | M | 5 | 2015 |
The one issue that I have is the original data frame got counted twice, so I would like to remove duplicate enteries. Remember I will not remove duplicate names since the "year" will be different.
dataX = dataX.drop_duplicates()
dataX=dataX.reset_index(drop=True)
dataX=pd.DataFrame(dataX)
print(len(dataX))
print(dataX.tail())
1858689 Name Sex number Year 1858684 Zykell M 5 2015 1858685 Zyking M 5 2015 1858686 Zykir M 5 2015 1858687 Zyrus M 5 2015 1858688 Zyus M 5 2015
We can quickly garner information from this data frame.
print('The total number of names registered for a SNN since 1880 is :' + str(dataX.number.sum()))
print('The highest name registered in one particular year was :'+ str(dataX.number.max()))
The total number of names registered for a SNN since 1880 is :340851912 The highest name registered in one particular year was :99680
Of course it would be nice to find that name and year.
dataX[dataX.number==99680].head()
Name | Sex | number | Year | |
---|---|---|---|---|
431053 | Linda | F | 99680 | 1947 |
This is one of the best things about pandas, it allows you to quickly slice and dice your data frame.
So if I call a data frame dataX[some condition] then it only show me the values that meet that condition. You can also create sub-data frames by equating it to a different name. You can also reduce the data frame by equating it to itself.
dataXF = dataX[dataX.Sex=="F"]
dataXM = dataX[dataX.Sex=="M"]
print(dataXM[dataXM.number==dataXM.number.max()])
print(dataXF[dataXF.number==dataXF.number.max()])
Name Sex number Year 437156 James M 94763 1947 Name Sex number Year 431053 Linda F 99680 1947
Seems like 1947 was a good year for James and Linda
How many different male and female names have there been?
print('Number of different Male names :' + str(dataXM.Name.nunique()))
print('Number of different Female names :' + str(dataXF.Name.nunique()))
Number of different Male names :39728 Number of different Female names :65658
Maybe, Males are less likely to be chosen to obtain a SSN number?
print('Number of Males :' + str(dataXM.number.sum()))
print('Number of Females:' + str(dataXF.number.sum()))
Number of Males :171990331 Number of Females:168861581
Nope, that is about even. Lets take a look at the trend of James and Linda.
dataNF = dataXF[dataXF.Name=="Linda"]
plt.plot(dataNF.Year,dataNF.number)
dataNM = dataXM[dataXM.Name=="James"]
plt.plot(dataNM.Year,dataNM.number)
[<matplotlib.lines.Line2D at 0x11a2e6e10>]
We can look at the number of names registered each year, by using a command called groupby
dataXFT = dataXF.groupby(('Year')).number.sum()
dataXMT = dataXM.groupby(('Year')).number.sum()
plt.plot(dataXFT)
plt.plot(dataXMT)
[<matplotlib.lines.Line2D at 0x11a4e7080>]
This plainly shows that the number of new names registered with the IRS happened at about 1950's. We can find out the maximum easily
print('The maximum female registrations happened in : '+str(dataXFT.argmax()) + ', with a total number of females being registered: '+ str(dataXFT.max()))
print('The maximum male registrations happened in : '+str(dataXMT.argmax()) + ', with a total number of males being registered: '+ str(dataXMT.max()))
The maximum female registrations happened in : 1957, with a total number of females being registered: 2044225 The maximum male registrations happened in : 1957, with a total number of males being registered: 2155921
This is what commonly is called the Baby boom
I am going to bring in another file, one that shows US population from 1900 until 2016
dataP = pd.read_csv('names/USpop.csv')
dataP = dataP[['Date','Pop']]
dataP=dataP.set_index('Date')
dataP.head()
plt.plot(dataP)
[<matplotlib.lines.Line2D at 0x11af56198>]
What we want to do now is put the population for each year as another column on our large dataX set. First we can make a dictionary and then map the year in the dataX dataframe to the dictionary. Since we do not have numbers for 1880-1899, we shall leave out that part of the dataX set. This method of setting up a dictionary and using pandas is much much quicker than writing a loop.
# make a dictionary
TotalPop = pd.Series(dataP.Pop.values, index=dataP.index).to_dict()
This set unfortunately is only from 1900 to 2016, so we need to restrict our dataX and then map our dictionary onto our dataX. Next I find out what population percentage is each name registration
dataX = dataX[dataX.Year>=1900]
dataX['TotalPop'] = dataX.Year.map(TotalPop)
dataX['PopPercent'] = dataX.number/dataX.TotalPop
dataX.head()
Name | Sex | number | Year | TotalPop | PopPercent | |
---|---|---|---|---|---|---|
52265 | Mary | F | 16707 | 1900 | 76090000 | 0.000220 |
52266 | Helen | F | 6343 | 1900 | 76090000 | 0.000083 |
52267 | Anna | F | 6114 | 1900 | 76090000 | 0.000080 |
52268 | Margaret | F | 5304 | 1900 | 76090000 | 0.000070 |
52269 | Ruth | F | 4765 | 1900 | 76090000 | 0.000063 |
This again will give us an estimate of the popularity of the name compared to the population at that time.
dataXF = dataX[dataX.Sex=="F"]
dataXM = dataX[dataX.Sex=="M"]
print(dataXM[dataXM.PopPercent==dataXM.PopPercent.max()])
print(dataXF[dataXF.PopPercent==dataXF.PopPercent.max()])
Name Sex number Year TotalPop PopPercent 437156 James M 94763 1947 144130000 0.000657 Name Sex number Year TotalPop PopPercent 431053 Linda F 99680 1947 144130000 0.000692
dataNF = dataXF[dataXF.Name=="Linda"]
plt.plot(dataNF.Year,dataNF.PopPercent)
dataNM = dataXM[dataXM.Name=="James"]
plt.plot(dataNM.Year,dataNM.PopPercent)
[<matplotlib.lines.Line2D at 0x11a2e6780>]
What this population percentage number can also do is show us the growth rate of the population, by simply adding up all of the PopPercent for each name
dataXFT = dataXF.groupby(('Year')).PopPercent.sum()
dataXMT = dataXM.groupby(('Year')).PopPercent.sum()
plt.plot(dataXFT)
plt.plot(dataXMT)
[<matplotlib.lines.Line2D at 0x1392677b8>]
print('The maximum percentage growth of female registrations happened in : '+str(dataXFT.argmax()) + ', with a total growth rate of females being registered: '+ str(dataXFT.max()*2))
print('The maximum percentage growth of male registrations happened in : '+str(dataXMT.argmax()) + ', with a total growth rate of males being registered: '+ str(dataXMT.max()*2))
The maximum percentage growth of female registrations happened in : 1947, with a total growth rate of females being registered: 0.0245979324221 The maximum percentage growth of male registrations happened in : 1947, with a total growth rate of males being registered: 0.0253839311732
People always talk about the baby boom. It is visible from the above graph, in 1947-ish the growth rate of the population is double of what it is today.
print("The growth rate of females in 1947 was: " +str(dataXFT.max()*100*2)+'%')
print("But in 2015, the rate was : " +str(dataXFT[2015]*100*2)+'%')
The growth rate of females in 1947 was: 2.45979324221% But in 2015, the rate was : 1.10506839048%