Part 3: selecting elements from a data frame - SOLVED
Contents
Part 3: selecting elements from a data frame - SOLVED#
This workbook requires you to load the titanicdataset. You will also need to run the following block of code to import numpy and pandas:
import pandas as pd
import numpy as np
Load the titanic data set as a pandas dataframes in the code block below:
# Load the titanic dataset
df_titanic = pd.read_excel("titanic.xlsx")
Selecting particular elements of a Data Frame#
Let us display the titanic data frame again:
df_titanic.head()
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Braund, Mr. Owen Harris | male | 22.0 | A/5 21171 | 7.2500 | NaN | 0 |
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | PC 17599 | 71.2833 | C85 | 1 |
| 2 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | 1 |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | C123 | 1 |
| 4 | 5 | Allen, Mr. William Henry | male | 35.0 | 373450 | 8.0500 | NaN | 0 |
So far we have learnt how to display a particular column of a data frame.
df_titanic["Name"]
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
How to select only entries that satisfy a particular condition. This is very similar to boolean indexing in numpy:
df_titanic[df_titanic["Age"] > 40]
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 6 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 11 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 33 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 35 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 862 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 865 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 873 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 8 columns
Note that the above has displayed a new data frame so that it contains only the elements that have Age value \(> 40\).
Let us save it in a new variable:
df_titanic_forty_plus = df_titanic[df_titanic["Age"] > 40]
df_titanic_forty_plus
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 6 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 11 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 33 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 35 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 862 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 865 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 873 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 8 columns
The index seems to be a bit odd, it links to the old df_titanic Data Frame, but what if we wanted this to start from \(0\).
df_titanic_forty_plus.reset_index()
| index | PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 1 | 11 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 2 | 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 3 | 33 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 4 | 35 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145 | 862 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 146 | 865 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 147 | 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 148 | 873 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 149 | 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 9 columns
The above has created a new data frame, and it saved the old index for you. What if we wanted to get rid of this index?
df_titanic_forty_plus.reset_index(drop=True)
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 1 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 2 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 3 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 4 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 146 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 147 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 148 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 149 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 8 columns
Exercise 3.1#
In the cell below filter the titanic data frame so that it only contains the people who survived sinking the titanic. Save it in a variable df_titanic_survived and display it.
df_titanic_survived = df_titanic[df_titanic["Survived"] == 1]
display(df_titanic_survived)
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | PC 17599 | 71.2833 | C85 | 1 |
| 2 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | 1 |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | C123 | 1 |
| 8 | 9 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 347742 | 11.1333 | NaN | 1 |
| 9 | 10 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 237736 | 30.0708 | NaN | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 875 | 876 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 2667 | 7.2250 | NaN | 1 |
| 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
| 880 | 881 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 230433 | 26.0000 | NaN | 1 |
| 887 | 888 | Graham, Miss. Margaret Edith | female | 19.0 | 112053 | 30.0000 | B42 | 1 |
| 889 | 890 | Behr, Mr. Karl Howell | male | 26.0 | 111369 | 30.0000 | C148 | 1 |
342 rows × 8 columns
Then reset its index and drop the old index. Then again overwrite df_titanic_survived with this result and display it again.
df_titanic_survived = df_titanic_survived.reset_index(drop=True)
display(df_titanic_survived)
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | PC 17599 | 71.2833 | C85 | 1 |
| 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | 1 |
| 2 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | C123 | 1 |
| 3 | 9 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 347742 | 11.1333 | NaN | 1 |
| 4 | 10 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 237736 | 30.0708 | NaN | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 337 | 876 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 2667 | 7.2250 | NaN | 1 |
| 338 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
| 339 | 881 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 230433 | 26.0000 | NaN | 1 |
| 340 | 888 | Graham, Miss. Margaret Edith | female | 19.0 | 112053 | 30.0000 | B42 | 1 |
| 341 | 890 | Behr, Mr. Karl Howell | male | 26.0 | 111369 | 30.0000 | C148 | 1 |
342 rows × 8 columns
Selection by label#
What if we wanted to display only particular columns of a data frame. We know how to display one column, but we want more.
df_titanic[["PassengerId", "Name", "Sex"]]
| PassengerId | Name | Sex | |
|---|---|---|---|
| 0 | 1 | Braund, Mr. Owen Harris | male |
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |
| 2 | 3 | Heikkinen, Miss. Laina | female |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female |
| 4 | 5 | Allen, Mr. William Henry | male |
| ... | ... | ... | ... |
| 886 | 887 | Montvila, Rev. Juozas | male |
| 887 | 888 | Graham, Miss. Margaret Edith | female |
| 888 | 889 | Johnston, Miss. Catherine Helen "Carrie" | female |
| 889 | 890 | Behr, Mr. Karl Howell | male |
| 890 | 891 | Dooley, Mr. Patrick | male |
891 rows × 3 columns
We can also use the location to get fewer rows (we have to make sure to use proper index values).
df_titanic.loc[5:10, ["PassengerId", "Name", "Sex"]]
| PassengerId | Name | Sex | |
|---|---|---|---|
| 5 | 6 | Moran, Mr. James | male |
| 6 | 7 | McCarthy, Mr. Timothy J | male |
| 7 | 8 | Palsson, Master. Gosta Leonard | male |
| 8 | 9 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female |
| 9 | 10 | Nasser, Mrs. Nicholas (Adele Achem) | female |
| 10 | 11 | Sandstrom, Miss. Marguerite Rut | female |
Exercise 3.2#
In the cell below create a data frame from df_titanic so that it contains only the columns “Name”, “Age” and “Survived” and it contains only 100 first rows.
df_titanic_new = df_titanic.loc[0:99,["Name", "Age","Survived"]]
df_titanic_new
| Name | Age | Survived | |
|---|---|---|---|
| 0 | Braund, Mr. Owen Harris | 22.0 | 0 |
| 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 |
| 2 | Heikkinen, Miss. Laina | 26.0 | 1 |
| 3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 |
| 4 | Allen, Mr. William Henry | 35.0 | 0 |
| ... | ... | ... | ... |
| 95 | Shorney, Mr. Charles Joseph | NaN | 0 |
| 96 | Goldschmidt, Mr. George B | 71.0 | 0 |
| 97 | Greenfield, Mr. William Bertram | 23.0 | 1 |
| 98 | Doling, Mrs. John T (Ada Julia Bone) | 34.0 | 1 |
| 99 | Kantor, Mr. Sinai | 34.0 | 0 |
100 rows × 3 columns
Then filter it so that it only contains people whose age is less than 18.
df_titanic_u18 = df_titanic_new[df_titanic_new["Age"] < 18]
display(df_titanic_u18)
| Name | Age | Survived | |
|---|---|---|---|
| 7 | Palsson, Master. Gosta Leonard | 2.00 | 0 |
| 9 | Nasser, Mrs. Nicholas (Adele Achem) | 14.00 | 1 |
| 10 | Sandstrom, Miss. Marguerite Rut | 4.00 | 1 |
| 14 | Vestrom, Miss. Hulda Amanda Adolfina | 14.00 | 0 |
| 16 | Rice, Master. Eugene | 2.00 | 0 |
| 22 | McGowan, Miss. Anna "Annie" | 15.00 | 1 |
| 24 | Palsson, Miss. Torborg Danira | 8.00 | 0 |
| 39 | Nicola-Yarred, Miss. Jamila | 14.00 | 1 |
| 43 | Laroche, Miss. Simonne Marie Anne Andree | 3.00 | 1 |
| 50 | Panula, Master. Juha Niilo | 7.00 | 0 |
| 58 | West, Miss. Constance Mirium | 5.00 | 1 |
| 59 | Goodwin, Master. William Frederick | 11.00 | 0 |
| 63 | Skoog, Master. Harald | 4.00 | 0 |
| 68 | Andersson, Miss. Erna Alexandra | 17.00 | 1 |
| 71 | Goodwin, Miss. Lillian Amy | 16.00 | 0 |
| 78 | Caldwell, Master. Alden Gates | 0.83 | 1 |
| 84 | Ilett, Miss. Bertha | 17.00 | 1 |
| 86 | Ford, Mr. William Neal | 16.00 | 0 |
Selection by position#
We can also select particular rows and columns by a numerical index. For that we use iloc rather than loc.
df_titanic.iloc[:10, 2:4]
| Sex | Age | |
|---|---|---|
| 0 | male | 22.0 |
| 1 | female | 38.0 |
| 2 | female | 26.0 |
| 3 | female | 35.0 |
| 4 | male | 35.0 |
| 5 | male | NaN |
| 6 | male | 54.0 |
| 7 | male | 2.0 |
| 8 | female | 27.0 |
| 9 | female | 14.0 |
Note that we displayed first 10 rows and the columns that have index 2 and 3.
df_titanic.columns[2], df_titanic.columns[3]
('Sex', 'Age')
Exercise 3.3#
In the cell below create a data frame from df_titanic so that it only contains the rows between 10th and 30th, and columns from 4th to 6th.
df_titanic_shrink = df_titanic.iloc[10:31,4:7]
display(df_titanic_shrink)
| Ticket | Fare | Cabin | |
|---|---|---|---|
| 10 | PP 9549 | 16.7000 | G6 |
| 11 | 113783 | 26.5500 | C103 |
| 12 | A/5. 2151 | 8.0500 | NaN |
| 13 | 347082 | 31.2750 | NaN |
| 14 | 350406 | 7.8542 | NaN |
| 15 | 248706 | 16.0000 | NaN |
| 16 | 382652 | 29.1250 | NaN |
| 17 | 244373 | 13.0000 | NaN |
| 18 | 345763 | 18.0000 | NaN |
| 19 | 2649 | 7.2250 | NaN |
| 20 | 239865 | 26.0000 | NaN |
| 21 | 248698 | 13.0000 | D56 |
| 22 | 330923 | 8.0292 | NaN |
| 23 | 113788 | 35.5000 | A6 |
| 24 | 349909 | 21.0750 | NaN |
| 25 | 347077 | 31.3875 | NaN |
| 26 | 2631 | 7.2250 | NaN |
| 27 | 19950 | 263.0000 | C23 C25 C27 |
| 28 | 330959 | 7.8792 | NaN |
| 29 | 349216 | 7.8958 | NaN |
| 30 | PC 17601 | 27.7208 | NaN |
Important note that iloc works as follows#
DataFrame.iloc[ start:end:step, start:end:step]
The first entry represents rows the second entry represents columns.
We can also display single elements:
df_titanic.iloc[30, 7]
0
For single elements we recommend using iat, as it runs faster if the element is a scalar.
df_titanic.iat[30, 7]
0