Part 3: selecting elements from a data frame#

This workbook requires you to load the titanicdataset. You will also need to run the following block of code to import numpy and pandas:

import pandas as pd
import numpy as np

Load the titanic data set as a pandas dataframe in the code block below:

# Write code to load the Titanic data sets
df_titanic = pd.read_excel("titanic.xlsx")

Selecting particular elements of a Data Frame#

Let us display the titanic data frame again:

df_titanic.head()

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	1	Braund, Mr. Owen Harris	male	22.0	A/5 21171	7.2500	NaN	0
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	PC 17599	71.2833	C85	1
2	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	1
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	C123	1
4	5	Allen, Mr. William Henry	male	35.0	373450	8.0500	NaN	0

So far we have learnt how to display a particular column of a data frame.

df_titanic["Name"]

                              Braund, Mr. Owen Harris
    Cumings, Mrs. John Bradley (Florence Briggs Th...
                               Heikkinen, Miss. Laina
         Futrelle, Mrs. Jacques Heath (Lily May Peel)
                             Allen, Mr. William Henry
                             ...                        
                              Montvila, Rev. Juozas
                       Graham, Miss. Margaret Edith
           Johnston, Miss. Catherine Helen "Carrie"
                              Behr, Mr. Karl Howell
                                Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

How to select only entries that satisfy a particular condition. This is very similar to boolean indexing in numpy:

df_titanic[df_titanic["Age"] > 40]

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
6	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
11	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
33	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
35	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...
862	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
865	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
873	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 8 columns

Note that the above has displayed a new data frame so that it contains only the elements that have Age value \(> 40\).

Let us save it in a new variable:

df_titanic_forty_plus = df_titanic[df_titanic["Age"] > 40]
df_titanic_forty_plus

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
6	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
11	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
33	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
35	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...
862	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
865	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
873	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 8 columns

The index seems to be a bit odd, it links to the old df_titanic Data Frame, but what if we wanted this to start from \(0\).

df_titanic_forty_plus.reset_index()

	index	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	6	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
1	11	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
2	15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
3	33	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
4	35	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...	...
145	862	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
146	865	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
147	871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
148	873	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
149	879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 9 columns

The above has created a new data frame, and it saved the old index for you. What if we wanted to get rid of this index?

df_titanic_forty_plus.reset_index(drop=True)

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
1	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
2	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
3	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
4	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...
145	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
146	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
147	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
148	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
149	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 8 columns

Exercise 3.1#

In the cell below filter the titanic data frame so that it only contains the people who survived sinking the titanic. Save it in a variable df_titanic_survived and display it.

Then reset its index and drop the old index. Then again overwrite df_titanic_survived with this result and display it again.

Selection by label#

What if we wanted to display only particular columns of a data frame. We know how to display one column, but we want more.

df_titanic[["PassengerId", "Name", "Sex"]]

	PassengerId	Name	Sex
0	1	Braund, Mr. Owen Harris	male
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female
2	3	Heikkinen, Miss. Laina	female
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female
4	5	Allen, Mr. William Henry	male
...	...	...	...
886	887	Montvila, Rev. Juozas	male
887	888	Graham, Miss. Margaret Edith	female
888	889	Johnston, Miss. Catherine Helen "Carrie"	female
889	890	Behr, Mr. Karl Howell	male
890	891	Dooley, Mr. Patrick	male

891 rows × 3 columns

We can also use the location to get fewer rows (we have to make sure to use proper index values).

df_titanic.loc[5:10, ["PassengerId", "Name", "Sex"]]

	PassengerId	Name	Sex
5	6	Moran, Mr. James	male
6	7	McCarthy, Mr. Timothy J	male
7	8	Palsson, Master. Gosta Leonard	male
8	9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female
9	10	Nasser, Mrs. Nicholas (Adele Achem)	female
10	11	Sandstrom, Miss. Marguerite Rut	female

Exercise 3.2#

In the cell below create a data frame from df_titanic so that it contains only the columns “Name”, “Age” and “Survived” and it contains only 100 first rows.

Then filter it so that it only contains people whose age is less than 18.

Selection by position#

We can also select particular rows and columns by a numerical index. For that we use iloc rather than loc.

df_titanic.iloc[:10, 2:4]

	Sex	Age
0	male	22.0
1	female	38.0
2	female	26.0
3	female	35.0
4	male	35.0
5	male	NaN
6	male	54.0
7	male	2.0
8	female	27.0
9	female	14.0

Note that we displayed first 10 rows and the columns that have index 2 and 3.

df_titanic.columns[2], df_titanic.columns[3]

('Sex', 'Age')

Exercise 3.3#

In the cell below create a data frame from df_titanic so that it only contains the rows between 10th and 30th, and columns from 4th to 6th.

Important note that `iloc` works as follows#

DataFrame.iloc[ start:end:step, start:end:step]

The first entry represents rows the second entry represents columns.

We can also display single elements:

df_titanic.iloc[30, 7]

For single elements we recommend using iat, as it runs faster if the element is a scalar.

df_titanic.iat[30, 7]

M24568 and M20741 Intro to CompSMAP

Part 3: selecting elements from a data frame

Contents

Part 3: selecting elements from a data frame#

Selecting particular elements of a Data Frame#

Exercise 3.1#

Selection by label#

Exercise 3.2#

Selection by position#

Exercise 3.3#

Important note that `iloc` works as follows#

M24568 and M20741 Intro to CompSMAP

Part 3: selecting elements from a data frame

Contents

Part 3: selecting elements from a data frame#

Selecting particular elements of a Data Frame#

Exercise 3.1#

Selection by label#

Exercise 3.2#

Selection by position#

Exercise 3.3#

Important note that iloc works as follows#

Important note that `iloc` works as follows#