Part 4: More operations on DataFrames - SOLVED#

This notebook will make use of the titanic dataset, which you will need to load. You will also need to import pandas and numpy by completing the cells below:

# complete this cell to import numpy and pandas
import numpy as np
import pandas as pd

Use the cell below to import the titanic dataset as a pandas dataframe:

# complete this cell to import the titanic dataset.
df_titanic = pd.read_excel("titanic.xlsx")
df_titanic.columns
Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
       'Survived'],
      dtype='object')

Operations on rows, columns and entries#

We have already seen how to display stats of a data frame. How about individual stats?

df_titanic.mean(numeric_only=True)
PassengerId    446.000000
Age             29.699118
Fare            32.204208
Survived         0.383838
dtype: float64

The above displayed the average of every column. How about the average of every numerical row? Remember the axis argument in numpy? You can use it here.

df_titanic.mean(axis=1,numeric_only=True)
0        7.562500
1       28.070825
2        9.481250
3       23.275000
4       12.012500
          ...    
886    231.750000
887    234.500000
888    304.150000
889    236.750000
890    232.687500
Length: 891, dtype: float64

We can also use the other functions such as sum, std, median, max, min.

df_titanic.sum(numeric_only=True)
PassengerId    397386.0000
Age             21205.1700
Fare            28693.9493
Survived          342.0000
dtype: float64

Note that sum is a bit silly and it also contatenates the strings.

df_titanic.std(numeric_only=True)
PassengerId    257.353842
Age             14.526497
Fare            49.693429
Survived         0.486592
dtype: float64
df_titanic.median(numeric_only=True)
PassengerId    446.0000
Age             28.0000
Fare            14.4542
Survived         0.0000
dtype: float64
df_titanic.max(numeric_only=True)
PassengerId    891.0000
Age             80.0000
Fare           512.3292
Survived         1.0000
dtype: float64
df_titanic.min(numeric_only=True)
PassengerId    1.00
Age            0.42
Fare           0.00
Survived       0.00
dtype: float64

Exercise 4.1#

From the df_titanic data frame find the total number of passengers who survived the sinking.

df_titanic_survived = df_titanic[df_titanic['Survived'] == 1]
n_survived = df_titanic_survived['Survived'].sum()
print(n_survived)
342

From the df_titanic data frame find the average price of the fare.

mean_fare = df_titanic['Fare'].mean()
print(mean_fare)
32.204207968574636

Correlation.#

To find how are the columns related to each other, for that we display the correlation matrix.

The correlation is a coefficient between \(-1\) and \(1\):

  • \(0\) correlation, or very low absolute value of correlation means that both columns are almost not related at all.

  • \(1\) correlation, or very high value of correlation means that they are related and their behaviours are alike.

  • \(-1\) correlation, or very low value of correlation means that they are related and their behaviour are opposite.

df_titanic.corr(numeric_only=True)
PassengerId Age Fare Survived
PassengerId 1.000000 0.036847 0.012658 -0.005007
Age 0.036847 1.000000 0.096067 -0.077221
Fare 0.012658 0.096067 1.000000 0.257307
Survived -0.005007 -0.077221 0.257307 1.000000

Exercise 4.2#

Take a moment to look at the above correlations, do you understand why on the main diagonal the value of the correlation is 1?

String operations on columns#

If we wish to split a values of data frame columns with respect to some separator (e.g. comma, fullstop, space). We can do it as follows:

surnames = df_titanic.Name.str.split(",")
surnames
0                             [Braund,  Mr. Owen Harris]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                              [Heikkinen,  Miss. Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                      [Graham,  Miss. Margaret Edith]
888          [Johnston,  Miss. Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

To extract the surname we can use the for loop

surnames = pd.Series([surname[0] for surname in surnames.values])
display(surnames)
0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Length: 891, dtype: object

Exercise 4.5#

We wish to check all unique surnames from the pd.Series surnames.

First check the size of surnames check its size similarly as you would do that in the numpy array, using surnames.size.

To get only the unique surnames use your_pd_series.unique() and then print its size.

Why do you think the number reduced?

s = surnames.size
print(s)

surnames_unique = surnames.unique()
print(surnames_unique.size)
891
667

We can also replace symbols inside the column that contains strings:

df_titanic.Name.str.replace(".", "||")
0                               Braund, Mr|| Owen Harris
1      Cumings, Mrs|| John Bradley (Florence Briggs T...
2                                Heikkinen, Miss|| Laina
3          Futrelle, Mrs|| Jacques Heath (Lily May Peel)
4                              Allen, Mr|| William Henry
                             ...                        
886                               Montvila, Rev|| Juozas
887                        Graham, Miss|| Margaret Edith
888            Johnston, Miss|| Catherine Helen "Carrie"
889                               Behr, Mr|| Karl Howell
890                                 Dooley, Mr|| Patrick
Name: Name, Length: 891, dtype: object

Exercise 4.6#

In the df_titanic.Name column replace all the dots and commas with |. Then display your result

name_col = df_titanic.Name.str.replace(".", "|")
name_col = name_col.str.replace(",", "|")
display(name_col)
0                                Braund| Mr| Owen Harris
1      Cumings| Mrs| John Bradley (Florence Briggs Th...
2                                 Heikkinen| Miss| Laina
3           Futrelle| Mrs| Jacques Heath (Lily May Peel)
4                               Allen| Mr| William Henry
                             ...                        
886                                Montvila| Rev| Juozas
887                         Graham| Miss| Margaret Edith
888             Johnston| Miss| Catherine Helen "Carrie"
889                                Behr| Mr| Karl Howell
890                                  Dooley| Mr| Patrick
Name: Name, Length: 891, dtype: object

We can also check use str for selection:

df_titanic[df_titanic.Name.str.contains("Mrs.")]
PassengerId Name Sex Age Ticket Fare Cabin Survived
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
8 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 347742 11.1333 NaN 1
9 10 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 237736 30.0708 NaN 1
15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
... ... ... ... ... ... ... ... ...
871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
874 875 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 P/PP 3381 24.0000 NaN 1
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1
880 881 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 230433 26.0000 NaN 1
885 886 Rice, Mrs. William (Margaret Norton) female 39.0 382652 29.1250 NaN 0

129 rows × 8 columns

Exercise 4.7#

Was there a doctor (Dr. ) on board? Find the list of all doctors (hopefully these are medial doctors not academic ones … I (Ian) deliberately never use my Dr title when travelling for this reason!). Sam has been woken up on a flight where a medical doctor was needed rather than an astrophysicist. From the list of all doctors display their name and the Cabin.

df_titanic_drs = df_titanic[df_titanic.Name.str.contains("Dr. ")][["Name", "Cabin"]]
display(df_titanic_drs)
Name Cabin
245 Minahan, Dr. William Edward C78
317 Moraweck, Dr. Ernest NaN
398 Pain, Dr. Alfred NaN
632 Stahelin-Maeglin, Dr. Max B50
660 Frauenthal, Dr. Henry William NaN
766 Brewe, Dr. Arthur Jackson NaN
796 Leader, Dr. Alice (Farnham) D17