Part 3: selecting elements from a data frame - SOLVED#

This workbook requires you to load the titanicdataset. You will also need to run the following block of code to import numpy and pandas:

import pandas as pd
import numpy as np

Load the titanic data set as a pandas dataframes in the code block below:

# Load the titanic dataset
df_titanic = pd.read_excel("titanic.xlsx")

Selecting particular elements of a Data Frame#

Let us display the titanic data frame again:

df_titanic.head()

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	1	Braund, Mr. Owen Harris	male	22.0	A/5 21171	7.2500	NaN	0
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	PC 17599	71.2833	C85	1
2	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	1
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	C123	1
4	5	Allen, Mr. William Henry	male	35.0	373450	8.0500	NaN	0

So far we have learnt how to display a particular column of a data frame.

df_titanic["Name"]

                              Braund, Mr. Owen Harris
    Cumings, Mrs. John Bradley (Florence Briggs Th...
                               Heikkinen, Miss. Laina
         Futrelle, Mrs. Jacques Heath (Lily May Peel)
                             Allen, Mr. William Henry
                             ...                        
                              Montvila, Rev. Juozas
                       Graham, Miss. Margaret Edith
           Johnston, Miss. Catherine Helen "Carrie"
                              Behr, Mr. Karl Howell
                                Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

How to select only entries that satisfy a particular condition. This is very similar to boolean indexing in numpy:

df_titanic[df_titanic["Age"] > 40]

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
6	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
11	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
33	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
35	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...
862	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
865	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
873	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 8 columns

Note that the above has displayed a new data frame so that it contains only the elements that have Age value \(> 40\).

Let us save it in a new variable:

df_titanic_forty_plus = df_titanic[df_titanic["Age"] > 40]
df_titanic_forty_plus

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
6	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
11	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
33	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
35	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...
862	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
865	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
873	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 8 columns

The index seems to be a bit odd, it links to the old df_titanic Data Frame, but what if we wanted this to start from \(0\).

df_titanic_forty_plus.reset_index()

	index	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	6	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
1	11	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
2	15	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
3	33	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
4	35	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...	...
145	862	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
146	865	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
147	871	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
148	873	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
149	879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 9 columns

The above has created a new data frame, and it saved the old index for you. What if we wanted to get rid of this index?

df_titanic_forty_plus.reset_index(drop=True)

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	7	McCarthy, Mr. Timothy J	male	54.0	17463	51.8625	E46	0
1	12	Bonnell, Miss. Elizabeth	female	58.0	113783	26.5500	C103	1
2	16	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	248706	16.0000	NaN	1
3	34	Wheadon, Mr. Edward H	male	66.0	C.A. 24579	10.5000	NaN	0
4	36	Holverson, Mr. Alexander Oskar	male	42.0	113789	52.0000	NaN	0
...	...	...	...	...	...	...	...	...
145	863	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	17466	25.9292	D17	1
146	866	Bystrom, Mrs. (Karolina)	female	42.0	236852	13.0000	NaN	1
147	872	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	11751	52.5542	D35	1
148	874	Vander Cruyssen, Mr. Victor	male	47.0	345765	9.0000	NaN	0
149	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1

150 rows × 8 columns

Exercise 3.1#

In the cell below filter the titanic data frame so that it only contains the people who survived sinking the titanic. Save it in a variable df_titanic_survived and display it.

df_titanic_survived = df_titanic[df_titanic["Survived"] == 1]

display(df_titanic_survived)

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	PC 17599	71.2833	C85	1
2	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	1
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	C123	1
8	9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	347742	11.1333	NaN	1
9	10	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	237736	30.0708	NaN	1
...	...	...	...	...	...	...	...	...
875	876	Najib, Miss. Adele Kiamie "Jane"	female	15.0	2667	7.2250	NaN	1
879	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1
880	881	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	230433	26.0000	NaN	1
887	888	Graham, Miss. Margaret Edith	female	19.0	112053	30.0000	B42	1
889	890	Behr, Mr. Karl Howell	male	26.0	111369	30.0000	C148	1

342 rows × 8 columns

Then reset its index and drop the old index. Then again overwrite df_titanic_survived with this result and display it again.

df_titanic_survived = df_titanic_survived.reset_index(drop=True)
display(df_titanic_survived)

	PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Survived
0	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	PC 17599	71.2833	C85	1
1	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	1
2	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	C123	1
3	9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	347742	11.1333	NaN	1
4	10	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	237736	30.0708	NaN	1
...	...	...	...	...	...	...	...	...
337	876	Najib, Miss. Adele Kiamie "Jane"	female	15.0	2667	7.2250	NaN	1
338	880	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	11767	83.1583	C50	1
339	881	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	230433	26.0000	NaN	1
340	888	Graham, Miss. Margaret Edith	female	19.0	112053	30.0000	B42	1
341	890	Behr, Mr. Karl Howell	male	26.0	111369	30.0000	C148	1

342 rows × 8 columns

Selection by label#

What if we wanted to display only particular columns of a data frame. We know how to display one column, but we want more.

df_titanic[["PassengerId", "Name", "Sex"]]

	PassengerId	Name	Sex
0	1	Braund, Mr. Owen Harris	male
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female
2	3	Heikkinen, Miss. Laina	female
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female
4	5	Allen, Mr. William Henry	male
...	...	...	...
886	887	Montvila, Rev. Juozas	male
887	888	Graham, Miss. Margaret Edith	female
888	889	Johnston, Miss. Catherine Helen "Carrie"	female
889	890	Behr, Mr. Karl Howell	male
890	891	Dooley, Mr. Patrick	male

891 rows × 3 columns

We can also use the location to get fewer rows (we have to make sure to use proper index values).

df_titanic.loc[5:10, ["PassengerId", "Name", "Sex"]]

	PassengerId	Name	Sex
5	6	Moran, Mr. James	male
6	7	McCarthy, Mr. Timothy J	male
7	8	Palsson, Master. Gosta Leonard	male
8	9	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female
9	10	Nasser, Mrs. Nicholas (Adele Achem)	female
10	11	Sandstrom, Miss. Marguerite Rut	female

Exercise 3.2#

In the cell below create a data frame from df_titanic so that it contains only the columns “Name”, “Age” and “Survived” and it contains only 100 first rows.

df_titanic_new = df_titanic.loc[0:99,["Name", "Age","Survived"]]
df_titanic_new

	Name	Age	Survived
0	Braund, Mr. Owen Harris	22.0	0
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1
2	Heikkinen, Miss. Laina	26.0	1
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1
4	Allen, Mr. William Henry	35.0	0
...	...	...	...
95	Shorney, Mr. Charles Joseph	NaN	0
96	Goldschmidt, Mr. George B	71.0	0
97	Greenfield, Mr. William Bertram	23.0	1
98	Doling, Mrs. John T (Ada Julia Bone)	34.0	1
99	Kantor, Mr. Sinai	34.0	0

100 rows × 3 columns

Then filter it so that it only contains people whose age is less than 18.

df_titanic_u18 = df_titanic_new[df_titanic_new["Age"] < 18]
display(df_titanic_u18)

	Name	Age	Survived
7	Palsson, Master. Gosta Leonard	2.00	0
9	Nasser, Mrs. Nicholas (Adele Achem)	14.00	1
10	Sandstrom, Miss. Marguerite Rut	4.00	1
14	Vestrom, Miss. Hulda Amanda Adolfina	14.00	0
16	Rice, Master. Eugene	2.00	0
22	McGowan, Miss. Anna "Annie"	15.00	1
24	Palsson, Miss. Torborg Danira	8.00	0
39	Nicola-Yarred, Miss. Jamila	14.00	1
43	Laroche, Miss. Simonne Marie Anne Andree	3.00	1
50	Panula, Master. Juha Niilo	7.00	0
58	West, Miss. Constance Mirium	5.00	1
59	Goodwin, Master. William Frederick	11.00	0
63	Skoog, Master. Harald	4.00	0
68	Andersson, Miss. Erna Alexandra	17.00	1
71	Goodwin, Miss. Lillian Amy	16.00	0
78	Caldwell, Master. Alden Gates	0.83	1
84	Ilett, Miss. Bertha	17.00	1
86	Ford, Mr. William Neal	16.00	0

Selection by position#

We can also select particular rows and columns by a numerical index. For that we use iloc rather than loc.

df_titanic.iloc[:10, 2:4]

	Sex	Age
0	male	22.0
1	female	38.0
2	female	26.0
3	female	35.0
4	male	35.0
5	male	NaN
6	male	54.0
7	male	2.0
8	female	27.0
9	female	14.0

Note that we displayed first 10 rows and the columns that have index 2 and 3.

df_titanic.columns[2], df_titanic.columns[3]

('Sex', 'Age')

Exercise 3.3#

In the cell below create a data frame from df_titanic so that it only contains the rows between 10th and 30th, and columns from 4th to 6th.

df_titanic_shrink = df_titanic.iloc[10:31,4:7]
display(df_titanic_shrink)

	Ticket	Fare	Cabin
10	PP 9549	16.7000	G6
11	113783	26.5500	C103
12	A/5. 2151	8.0500	NaN
13	347082	31.2750	NaN
14	350406	7.8542	NaN
15	248706	16.0000	NaN
16	382652	29.1250	NaN
17	244373	13.0000	NaN
18	345763	18.0000	NaN
19	2649	7.2250	NaN
20	239865	26.0000	NaN
21	248698	13.0000	D56
22	330923	8.0292	NaN
23	113788	35.5000	A6
24	349909	21.0750	NaN
25	347077	31.3875	NaN
26	2631	7.2250	NaN
27	19950	263.0000	C23 C25 C27
28	330959	7.8792	NaN
29	349216	7.8958	NaN
30	PC 17601	27.7208	NaN

Important note that `iloc` works as follows#

DataFrame.iloc[ start:end:step, start:end:step]

The first entry represents rows the second entry represents columns.

We can also display single elements:

df_titanic.iloc[30, 7]

For single elements we recommend using iat, as it runs faster if the element is a scalar.

df_titanic.iat[30, 7]

M24568 and M20741 Intro to CompSMAP

Part 3: selecting elements from a data frame - SOLVED

Contents

Part 3: selecting elements from a data frame - SOLVED#

Selecting particular elements of a Data Frame#

Exercise 3.1#

Selection by label#

Exercise 3.2#

Selection by position#

Exercise 3.3#

Important note that `iloc` works as follows#

M24568 and M20741 Intro to CompSMAP

Part 3: selecting elements from a data frame - SOLVED

Contents

Part 3: selecting elements from a data frame - SOLVED#

Selecting particular elements of a Data Frame#

Exercise 3.1#

Selection by label#

Exercise 3.2#

Selection by position#

Exercise 3.3#

Important note that iloc works as follows#

Important note that `iloc` works as follows#