Part 4: More operations on DataFrames - SOLVED
Contents
Part 4: More operations on DataFrames - SOLVED#
This notebook will make use of the titanic dataset, which you will need to load. You will also need to import pandas and numpy by completing the cells below:
# complete this cell to import numpy and pandas
import numpy as np
import pandas as pd
Use the cell below to import the titanic dataset as a pandas dataframe:
# complete this cell to import the titanic dataset.
df_titanic = pd.read_excel("titanic.xlsx")
df_titanic.columns
Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
'Survived'],
dtype='object')
Operations on rows, columns and entries#
We have already seen how to display stats of a data frame. How about individual stats?
df_titanic.mean(numeric_only=True)
PassengerId 446.000000
Age 29.699118
Fare 32.204208
Survived 0.383838
dtype: float64
The above displayed the average of every column. How about the average of every numerical row? Remember the axis argument in numpy? You can use it here.
df_titanic.mean(axis=1,numeric_only=True)
0 7.562500
1 28.070825
2 9.481250
3 23.275000
4 12.012500
...
886 231.750000
887 234.500000
888 304.150000
889 236.750000
890 232.687500
Length: 891, dtype: float64
We can also use the other functions such as sum, std, median, max, min.
df_titanic.sum(numeric_only=True)
PassengerId 397386.0000
Age 21205.1700
Fare 28693.9493
Survived 342.0000
dtype: float64
Note that sum is a bit silly and it also contatenates the strings.
df_titanic.std(numeric_only=True)
PassengerId 257.353842
Age 14.526497
Fare 49.693429
Survived 0.486592
dtype: float64
df_titanic.median(numeric_only=True)
PassengerId 446.0000
Age 28.0000
Fare 14.4542
Survived 0.0000
dtype: float64
df_titanic.max(numeric_only=True)
PassengerId 891.0000
Age 80.0000
Fare 512.3292
Survived 1.0000
dtype: float64
df_titanic.min(numeric_only=True)
PassengerId 1.00
Age 0.42
Fare 0.00
Survived 0.00
dtype: float64
Exercise 4.1#
From the df_titanic data frame find the total number of passengers who survived the sinking.
df_titanic_survived = df_titanic[df_titanic['Survived'] == 1]
n_survived = df_titanic_survived['Survived'].sum()
print(n_survived)
342
From the df_titanic data frame find the average price of the fare.
mean_fare = df_titanic['Fare'].mean()
print(mean_fare)
32.204207968574636
Correlation.#
To find how are the columns related to each other, for that we display the correlation matrix.
The correlation is a coefficient between \(-1\) and \(1\):
\(0\) correlation, or very low absolute value of correlation means that both columns are almost not related at all.
\(1\) correlation, or very high value of correlation means that they are related and their behaviours are alike.
\(-1\) correlation, or very low value of correlation means that they are related and their behaviour are opposite.
df_titanic.corr(numeric_only=True)
| PassengerId | Age | Fare | Survived | |
|---|---|---|---|---|
| PassengerId | 1.000000 | 0.036847 | 0.012658 | -0.005007 |
| Age | 0.036847 | 1.000000 | 0.096067 | -0.077221 |
| Fare | 0.012658 | 0.096067 | 1.000000 | 0.257307 |
| Survived | -0.005007 | -0.077221 | 0.257307 | 1.000000 |
Exercise 4.2#
Take a moment to look at the above correlations, do you understand why on the main diagonal the value of the correlation is 1?
String operations on columns#
If we wish to split a values of data frame columns with respect to some separator (e.g. comma, fullstop, space). We can do it as follows:
surnames = df_titanic.Name.str.split(",")
surnames
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object
To extract the surname we can use the for loop
surnames = pd.Series([surname[0] for surname in surnames.values])
display(surnames)
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Length: 891, dtype: object
Exercise 4.5#
We wish to check all unique surnames from the pd.Series surnames.
First check the size of surnames check its size similarly as you would do that in the numpy array, using surnames.size.
To get only the unique surnames use your_pd_series.unique() and then print its size.
Why do you think the number reduced?
s = surnames.size
print(s)
surnames_unique = surnames.unique()
print(surnames_unique.size)
891
667
We can also replace symbols inside the column that contains strings:
df_titanic.Name.str.replace(".", "||")
0 Braund, Mr|| Owen Harris
1 Cumings, Mrs|| John Bradley (Florence Briggs T...
2 Heikkinen, Miss|| Laina
3 Futrelle, Mrs|| Jacques Heath (Lily May Peel)
4 Allen, Mr|| William Henry
...
886 Montvila, Rev|| Juozas
887 Graham, Miss|| Margaret Edith
888 Johnston, Miss|| Catherine Helen "Carrie"
889 Behr, Mr|| Karl Howell
890 Dooley, Mr|| Patrick
Name: Name, Length: 891, dtype: object
Exercise 4.6#
In the df_titanic.Name column replace all the dots and commas with |.
Then display your result
name_col = df_titanic.Name.str.replace(".", "|")
name_col = name_col.str.replace(",", "|")
display(name_col)
0 Braund| Mr| Owen Harris
1 Cumings| Mrs| John Bradley (Florence Briggs Th...
2 Heikkinen| Miss| Laina
3 Futrelle| Mrs| Jacques Heath (Lily May Peel)
4 Allen| Mr| William Henry
...
886 Montvila| Rev| Juozas
887 Graham| Miss| Margaret Edith
888 Johnston| Miss| Catherine Helen "Carrie"
889 Behr| Mr| Karl Howell
890 Dooley| Mr| Patrick
Name: Name, Length: 891, dtype: object
We can also check use str for selection:
df_titanic[df_titanic.Name.str.contains("Mrs.")]
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | PC 17599 | 71.2833 | C85 | 1 |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | C123 | 1 |
| 8 | 9 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 347742 | 11.1333 | NaN | 1 |
| 9 | 10 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 237736 | 30.0708 | NaN | 1 |
| 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 874 | 875 | Abelson, Mrs. Samuel (Hannah Wizosky) | female | 28.0 | P/PP 3381 | 24.0000 | NaN | 1 |
| 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
| 880 | 881 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 230433 | 26.0000 | NaN | 1 |
| 885 | 886 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 382652 | 29.1250 | NaN | 0 |
129 rows × 8 columns
Exercise 4.7#
Was there a doctor (Dr. ) on board? Find the list of all doctors (hopefully these are medial doctors not academic ones … I (Ian) deliberately never use my Dr title when travelling for this reason!). Sam has been woken up on a flight where a medical doctor was needed rather than an astrophysicist. From the list of all doctors display their name and the Cabin.
df_titanic_drs = df_titanic[df_titanic.Name.str.contains("Dr. ")][["Name", "Cabin"]]
display(df_titanic_drs)
| Name | Cabin | |
|---|---|---|
| 245 | Minahan, Dr. William Edward | C78 |
| 317 | Moraweck, Dr. Ernest | NaN |
| 398 | Pain, Dr. Alfred | NaN |
| 632 | Stahelin-Maeglin, Dr. Max | B50 |
| 660 | Frauenthal, Dr. Henry William | NaN |
| 766 | Brewe, Dr. Arthur Jackson | NaN |
| 796 | Leader, Dr. Alice (Farnham) | D17 |