Part 4: More operations on DataFrames - SOLVED#

This notebook will make use of the titanic dataset, which you will need to load. You will also need to import pandas and numpy by completing the cells below:

# complete this cell to import numpy and pandas
import numpy as np
import pandas as pd

Use the cell below to import the titanic dataset as a pandas dataframe:

# complete this cell to import the titanic dataset.
df_titanic = pd.read_excel("titanic.xlsx")

df_titanic.columns

Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
       'Survived'],
      dtype='object')

Operations on rows, columns and entries#

We have already seen how to display stats of a data frame. How about individual stats?

df_titanic.mean(numeric_only=True)

PassengerId    446.000000
Age             29.699118
Fare            32.204208
Survived         0.383838
dtype: float64

The above displayed the average of every column. How about the average of every numerical row? Remember the axis argument in numpy? You can use it here.

df_titanic.mean(axis=1,numeric_only=True)

      7.562500
     28.070825
      9.481250
     23.275000
     12.012500
          ...    
  231.750000
  234.500000
  304.150000
  236.750000
  232.687500
Length: 891, dtype: float64

We can also use the other functions such as sum, std, median, max, min.

df_titanic.sum(numeric_only=True)

PassengerId    397386.0000
Age             21205.1700
Fare            28693.9493
Survived          342.0000
dtype: float64

Note that sum is a bit silly and it also contatenates the strings.

df_titanic.std(numeric_only=True)

PassengerId    257.353842
Age             14.526497
Fare            49.693429
Survived         0.486592
dtype: float64

df_titanic.median(numeric_only=True)

PassengerId    446.0000
Age             28.0000
Fare            14.4542
Survived         0.0000
dtype: float64

df_titanic.max(numeric_only=True)

PassengerId    891.0000
Age             80.0000
Fare           512.3292
Survived         1.0000
dtype: float64

df_titanic.min(numeric_only=True)

PassengerId    1.00
Age            0.42
Fare           0.00
Survived       0.00
dtype: float64

Exercise 4.1#

From the df_titanic data frame find the total number of passengers who survived the sinking.

df_titanic_survived = df_titanic[df_titanic['Survived'] == 1]
n_survived = df_titanic_survived['Survived'].sum()
print(n_survived)

From the df_titanic data frame find the average price of the fare.

mean_fare = df_titanic['Fare'].mean()
print(mean_fare)

32.204207968574636

Correlation.#

To find how are the columns related to each other, for that we display the correlation matrix.

The correlation is a coefficient between \(-1\) and \(1\):

\(0\) correlation, or very low absolute value of correlation means that both columns are almost not related at all.
\(1\) correlation, or very high value of correlation means that they are related and their behaviours are alike.
\(-1\) correlation, or very low value of correlation means that they are related and their behaviour are opposite.

df_titanic.corr(numeric_only=True)

	PassengerId	Age	Fare	Survived
PassengerId	1.000000	0.036847	0.012658	-0.005007
Age	0.036847	1.000000	0.096067	-0.077221
Fare	0.012658	0.096067	1.000000	0.257307
Survived	-0.005007	-0.077221	0.257307	1.000000

Exercise 4.2#

Take a moment to look at the above correlations, do you understand why on the main diagonal the value of the correlation is 1?

String operations on columns#

If we wish to split a values of data frame columns with respect to some separator (e.g. comma, fullstop, space). We can do it as follows:

surnames = df_titanic.Name.str.split(",")
surnames

                           [Braund,  Mr. Owen Harris]
    [Cumings,  Mrs. John Bradley (Florence Briggs ...
                            [Heikkinen,  Miss. Laina]
      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
                          [Allen,  Mr. William Henry]
                             ...                        
                           [Montvila,  Rev. Juozas]
                    [Graham,  Miss. Margaret Edith]
        [Johnston,  Miss. Catherine Helen "Carrie"]
                           [Behr,  Mr. Karl Howell]
                             [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

To extract the surname we can use the for loop

surnames = pd.Series([surname[0] for surname in surnames.values])
display(surnames)

       Braund
      Cumings
    Heikkinen
     Futrelle
        Allen
         ...    
   Montvila
     Graham
   Johnston
       Behr
     Dooley
Length: 891, dtype: object

Exercise 4.5#

We wish to check all unique surnames from the pd.Series surnames.

First check the size of surnames check its size similarly as you would do that in the numpy array, using surnames.size.

To get only the unique surnames use your_pd_series.unique() and then print its size.

Why do you think the number reduced?

s = surnames.size
print(s)

surnames_unique = surnames.unique()
print(surnames_unique.size)

891
667

We can also replace symbols inside the column that contains strings:

df_titanic.Name.str.replace(".", "||")

                             Braund, Mr|| Owen Harris
    Cumings, Mrs|| John Bradley (Florence Briggs T...
                              Heikkinen, Miss|| Laina
        Futrelle, Mrs|| Jacques Heath (Lily May Peel)
                            Allen, Mr|| William Henry
                             ...                        
                             Montvila, Rev|| Juozas
                      Graham, Miss|| Margaret Edith
          Johnston, Miss|| Catherine Helen "Carrie"
                             Behr, Mr|| Karl Howell
                               Dooley, Mr|| Patrick
Name: Name, Length: 891, dtype: object

Exercise 4.6#

In the df_titanic.Name column replace all the dots and commas with |. Then display your result

name_col = df_titanic.Name.str.replace(".", "|")
name_col = name_col.str.replace(",", "|")
display(name_col)

                              Braund| Mr| Owen Harris
    Cumings| Mrs| John Bradley (Florence Briggs Th...
                               Heikkinen| Miss| Laina
         Futrelle| Mrs| Jacques Heath (Lily May Peel)
                             Allen| Mr| William Henry
                             ...                        
                              Montvila| Rev| Juozas
                       Graham| Miss| Margaret Edith
           Johnston| Miss| Catherine Helen "Carrie"
                              Behr| Mr| Karl Howell
                                Dooley| Mr| Patrick
Name: Name, Length: 891, dtype: object

We can also check use str for selection:

df_titanic[df_titanic.Name.str.contains("Mrs.")]

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	PC 17599	71.2833	C85	1
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	C123	1
8	9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	347742	11.1333	NaN	1
9	10	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	237736	30.0708	NaN	1
15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
...	...	...	...	...	...	...	...	...
871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
874	875	Abelson, Mrs. Samuel (Hannah Wizosky)	female	28.0	P/PP 3381	24.0000	NaN	1
879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1
880	881	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	230433	26.0000	NaN	1
885	886	Rice, Mrs. William (Margaret Norton)	female	39.0	382652	29.1250	NaN	0

129 rows × 8 columns

Exercise 4.7#

Was there a doctor (Dr. ) on board? Find the list of all doctors (hopefully these are medial doctors not academic ones … I (Ian) deliberately never use my Dr title when travelling for this reason!). Sam has been woken up on a flight where a medical doctor was needed rather than an astrophysicist. From the list of all doctors display their name and the Cabin.

df_titanic_drs = df_titanic[df_titanic.Name.str.contains("Dr. ")][["Name", "Cabin"]]

display(df_titanic_drs)

	Name	Cabin
245	Minahan, Dr. William Edward	C78
317	Moraweck, Dr. Ernest	NaN
398	Pain, Dr. Alfred	NaN
632	Stahelin-Maeglin, Dr. Max	B50
660	Frauenthal, Dr. Henry William	NaN
766	Brewe, Dr. Arthur Jackson	NaN
796	Leader, Dr. Alice (Farnham)	D17

M24568 and M20741 Intro to CompSMAP

Part 4: More operations on DataFrames - SOLVED

Contents