Part 3: selecting elements from a data frame#

This workbook requires you to load the titanicdataset. You will also need to run the following block of code to import numpy and pandas:

import pandas as pd
import numpy as np

Load the titanic data set as a pandas dataframe in the code block below:

# Write code to load the Titanic data sets
df_titanic = pd.read_excel("titanic.xlsx")

Selecting particular elements of a Data Frame#

Let us display the titanic data frame again:

df_titanic.head()
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 1 Braund, Mr. Owen Harris male 22.0 A/5 21171 7.2500 NaN 0
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
2 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 1
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
4 5 Allen, Mr. William Henry male 35.0 373450 8.0500 NaN 0

So far we have learnt how to display a particular column of a data frame.

df_titanic["Name"]
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

How to select only entries that satisfy a particular condition. This is very similar to boolean indexing in numpy:

df_titanic[df_titanic["Age"] > 40]
PassengerId Name Sex Age Ticket Fare Cabin Survived
6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
11 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
33 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
35 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ...
862 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
865 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
873 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 8 columns

Note that the above has displayed a new data frame so that it contains only the elements that have Age value \(> 40\).

Let us save it in a new variable:

df_titanic_forty_plus = df_titanic[df_titanic["Age"] > 40]
df_titanic_forty_plus
PassengerId Name Sex Age Ticket Fare Cabin Survived
6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
11 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
33 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
35 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ...
862 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
865 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
873 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 8 columns

The index seems to be a bit odd, it links to the old df_titanic Data Frame, but what if we wanted this to start from \(0\).

df_titanic_forty_plus.reset_index()
index PassengerId Name Sex Age Ticket Fare Cabin Survived
0 6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
1 11 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
2 15 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
3 33 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
4 35 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ... ...
145 862 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
146 865 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
147 871 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
148 873 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
149 879 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 9 columns

The above has created a new data frame, and it saved the old index for you. What if we wanted to get rid of this index?

df_titanic_forty_plus.reset_index(drop=True)
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
1 12 Bonnell, Miss. Elizabeth female 58.0 113783 26.5500 C103 1
2 16 Hewlett, Mrs. (Mary D Kingcome) female 55.0 248706 16.0000 NaN 1
3 34 Wheadon, Mr. Edward H male 66.0 C.A. 24579 10.5000 NaN 0
4 36 Holverson, Mr. Alexander Oskar male 42.0 113789 52.0000 NaN 0
... ... ... ... ... ... ... ... ...
145 863 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 17466 25.9292 D17 1
146 866 Bystrom, Mrs. (Karolina) female 42.0 236852 13.0000 NaN 1
147 872 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 11751 52.5542 D35 1
148 874 Vander Cruyssen, Mr. Victor male 47.0 345765 9.0000 NaN 0
149 880 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 11767 83.1583 C50 1

150 rows × 8 columns

Exercise 3.1#

In the cell below filter the titanic data frame so that it only contains the people who survived sinking the titanic. Save it in a variable df_titanic_survived and display it.

Then reset its index and drop the old index. Then again overwrite df_titanic_survived with this result and display it again.

Selection by label#

What if we wanted to display only particular columns of a data frame. We know how to display one column, but we want more.

df_titanic[["PassengerId", "Name", "Sex"]]
PassengerId Name Sex
0 1 Braund, Mr. Owen Harris male
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 3 Heikkinen, Miss. Laina female
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 5 Allen, Mr. William Henry male
... ... ... ...
886 887 Montvila, Rev. Juozas male
887 888 Graham, Miss. Margaret Edith female
888 889 Johnston, Miss. Catherine Helen "Carrie" female
889 890 Behr, Mr. Karl Howell male
890 891 Dooley, Mr. Patrick male

891 rows × 3 columns

We can also use the location to get fewer rows (we have to make sure to use proper index values).

df_titanic.loc[5:10, ["PassengerId", "Name", "Sex"]]
PassengerId Name Sex
5 6 Moran, Mr. James male
6 7 McCarthy, Mr. Timothy J male
7 8 Palsson, Master. Gosta Leonard male
8 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female
9 10 Nasser, Mrs. Nicholas (Adele Achem) female
10 11 Sandstrom, Miss. Marguerite Rut female

Exercise 3.2#

In the cell below create a data frame from df_titanic so that it contains only the columns “Name”, “Age” and “Survived” and it contains only 100 first rows.

Then filter it so that it only contains people whose age is less than 18.

Selection by position#

We can also select particular rows and columns by a numerical index. For that we use iloc rather than loc.

df_titanic.iloc[:10, 2:4]
Sex Age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
5 male NaN
6 male 54.0
7 male 2.0
8 female 27.0
9 female 14.0

Note that we displayed first 10 rows and the columns that have index 2 and 3.

df_titanic.columns[2], df_titanic.columns[3]
('Sex', 'Age')

Exercise 3.3#

In the cell below create a data frame from df_titanic so that it only contains the rows between 10th and 30th, and columns from 4th to 6th.

Important note that iloc works as follows#

DataFrame.iloc[ start:end:step, start:end:step]

The first entry represents rows the second entry represents columns.

We can also display single elements:

df_titanic.iloc[30, 7]
0

For single elements we recommend using iat, as it runs faster if the element is a scalar.

df_titanic.iat[30, 7]
0