Part 3: selecting elements from a data frame - SOLVED#

This workbook requires you to load the titanicdataset. You will also need to run the following block of code to import numpy and pandas:

import pandas as pd
import numpy as np

Load the titanic data set as a pandas dataframes in the code block below:

# Load the titanic dataset
df_titanic = pd.read_excel("titanic.xlsx")

Selecting particular elements of a Data Frame#

Let us display the titanic data frame again:

df_titanic.head()
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 1 Braund, Mr. Owen Harris male 22.0 A/5 21171 7.2500 NaN 0
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
2 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 1
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
4 5 Allen, Mr. William Henry male 35.0 373450 8.0500 NaN 0

So far we have learnt how to display a particular column of a data frame.

df_titanic["Name"]
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

How to select only entries that satisfy a particular condition. This is very similar to boolean indexing in numpy:

df_titanic[df_titanic["Age"] > 40]
PassengerId Name Sex Age Ticket Fare Cabin Survived
6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
11 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
33 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
35 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ...
862 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
865 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
873 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 8 columns

Note that the above has displayed a new data frame so that it contains only the elements that have Age value \(> 40\).

Let us save it in a new variable:

df_titanic_forty_plus = df_titanic[df_titanic["Age"] > 40]
df_titanic_forty_plus
PassengerId Name Sex Age Ticket Fare Cabin Survived
6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
11 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
33 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
35 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ...
862 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
865 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
873 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 8 columns

The index seems to be a bit odd, it links to the old df_titanic Data Frame, but what if we wanted this to start from \(0\).

df_titanic_forty_plus.reset_index()
index PassengerId Name Sex Age Ticket Fare Cabin Survived
0 6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
1 11 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
2 15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
3 33 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
4 35 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ... ...
145 862 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
146 865 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
147 871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
148 873 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
149 879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 9 columns

The above has created a new data frame, and it saved the old index for you. What if we wanted to get rid of this index?

df_titanic_forty_plus.reset_index(drop=True)
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
1 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
2 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
3 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
4 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ...
145 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
146 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
147 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
148 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
149 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 8 columns

Exercise 3.1#

In the cell below filter the titanic data frame so that it only contains the people who survived sinking the titanic. Save it in a variable df_titanic_survived and display it.

df_titanic_survived = df_titanic[df_titanic["Survived"] == 1]

display(df_titanic_survived)
PassengerId Name Sex Age Ticket Fare Cabin Survived
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
2 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 1
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
8 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 347742 11.1333 NaN 1
9 10 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 237736 30.0708 NaN 1
... ... ... ... ... ... ... ... ...
875 876 Najib, Miss. Adele Kiamie "Jane" female 15.0 2667 7.2250 NaN 1
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1
880 881 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 230433 26.0000 NaN 1
887 888 Graham, Miss. Margaret Edith female 19.0 112053 30.0000 B42 1
889 890 Behr, Mr. Karl Howell male 26.0 111369 30.0000 C148 1

342 rows × 8 columns

Then reset its index and drop the old index. Then again overwrite df_titanic_survived with this result and display it again.

df_titanic_survived = df_titanic_survived.reset_index(drop=True)
display(df_titanic_survived)
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
1 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 1
2 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
3 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 347742 11.1333 NaN 1
4 10 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 237736 30.0708 NaN 1
... ... ... ... ... ... ... ... ...
337 876 Najib, Miss. Adele Kiamie "Jane" female 15.0 2667 7.2250 NaN 1
338 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1
339 881 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 230433 26.0000 NaN 1
340 888 Graham, Miss. Margaret Edith female 19.0 112053 30.0000 B42 1
341 890 Behr, Mr. Karl Howell male 26.0 111369 30.0000 C148 1

342 rows × 8 columns

Selection by label#

What if we wanted to display only particular columns of a data frame. We know how to display one column, but we want more.

df_titanic[["PassengerId", "Name", "Sex"]]
PassengerId Name Sex
0 1 Braund, Mr. Owen Harris male
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 3 Heikkinen, Miss. Laina female
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 5 Allen, Mr. William Henry male
... ... ... ...
886 887 Montvila, Rev. Juozas male
887 888 Graham, Miss. Margaret Edith female
888 889 Johnston, Miss. Catherine Helen "Carrie" female
889 890 Behr, Mr. Karl Howell male
890 891 Dooley, Mr. Patrick male

891 rows × 3 columns

We can also use the location to get fewer rows (we have to make sure to use proper index values).

df_titanic.loc[5:10, ["PassengerId", "Name", "Sex"]]
PassengerId Name Sex
5 6 Moran, Mr. James male
6 7 McCarthy, Mr. Timothy J male
7 8 Palsson, Master. Gosta Leonard male
8 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female
9 10 Nasser, Mrs. Nicholas (Adele Achem) female
10 11 Sandstrom, Miss. Marguerite Rut female

Exercise 3.2#

In the cell below create a data frame from df_titanic so that it contains only the columns “Name”, “Age” and “Survived” and it contains only 100 first rows.

df_titanic_new = df_titanic.loc[0:99,["Name", "Age","Survived"]]
df_titanic_new
Name Age Survived
0 Braund, Mr. Owen Harris 22.0 0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1
2 Heikkinen, Miss. Laina 26.0 1
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1
4 Allen, Mr. William Henry 35.0 0
... ... ... ...
95 Shorney, Mr. Charles Joseph NaN 0
96 Goldschmidt, Mr. George B 71.0 0
97 Greenfield, Mr. William Bertram 23.0 1
98 Doling, Mrs. John T (Ada Julia Bone) 34.0 1
99 Kantor, Mr. Sinai 34.0 0

100 rows × 3 columns

Then filter it so that it only contains people whose age is less than 18.

df_titanic_u18 = df_titanic_new[df_titanic_new["Age"] < 18]
display(df_titanic_u18)
Name Age Survived
7 Palsson, Master. Gosta Leonard 2.00 0
9 Nasser, Mrs. Nicholas (Adele Achem) 14.00 1
10 Sandstrom, Miss. Marguerite Rut 4.00 1
14 Vestrom, Miss. Hulda Amanda Adolfina 14.00 0
16 Rice, Master. Eugene 2.00 0
22 McGowan, Miss. Anna "Annie" 15.00 1
24 Palsson, Miss. Torborg Danira 8.00 0
39 Nicola-Yarred, Miss. Jamila 14.00 1
43 Laroche, Miss. Simonne Marie Anne Andree 3.00 1
50 Panula, Master. Juha Niilo 7.00 0
58 West, Miss. Constance Mirium 5.00 1
59 Goodwin, Master. William Frederick 11.00 0
63 Skoog, Master. Harald 4.00 0
68 Andersson, Miss. Erna Alexandra 17.00 1
71 Goodwin, Miss. Lillian Amy 16.00 0
78 Caldwell, Master. Alden Gates 0.83 1
84 Ilett, Miss. Bertha 17.00 1
86 Ford, Mr. William Neal 16.00 0

Selection by position#

We can also select particular rows and columns by a numerical index. For that we use iloc rather than loc.

df_titanic.iloc[:10, 2:4]
Sex Age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
5 male NaN
6 male 54.0
7 male 2.0
8 female 27.0
9 female 14.0

Note that we displayed first 10 rows and the columns that have index 2 and 3.

df_titanic.columns[2], df_titanic.columns[3]
('Sex', 'Age')

Exercise 3.3#

In the cell below create a data frame from df_titanic so that it only contains the rows between 10th and 30th, and columns from 4th to 6th.

df_titanic_shrink = df_titanic.iloc[10:31,4:7]
display(df_titanic_shrink)
Ticket Fare Cabin
10 PP 9549 16.7000 G6
11 113783 26.5500 C103
12 A/5. 2151 8.0500 NaN
13 347082 31.2750 NaN
14 350406 7.8542 NaN
15 248706 16.0000 NaN
16 382652 29.1250 NaN
17 244373 13.0000 NaN
18 345763 18.0000 NaN
19 2649 7.2250 NaN
20 239865 26.0000 NaN
21 248698 13.0000 D56
22 330923 8.0292 NaN
23 113788 35.5000 A6
24 349909 21.0750 NaN
25 347077 31.3875 NaN
26 2631 7.2250 NaN
27 19950 263.0000 C23 C25 C27
28 330959 7.8792 NaN
29 349216 7.8958 NaN
30 PC 17601 27.7208 NaN

Important note that iloc works as follows#

DataFrame.iloc[ start:end:step, start:end:step]

The first entry represents rows the second entry represents columns.

We can also display single elements:

df_titanic.iloc[30, 7]
0

For single elements we recommend using iat, as it runs faster if the element is a scalar.

df_titanic.iat[30, 7]
0