Part 3: selecting elements from a data frame
Contents
Part 3: selecting elements from a data frame#
This workbook requires you to load the titanicdataset. You will also need to run the following block of code to import numpy and pandas:
import pandas as pd
import numpy as np
Load the titanic data set as a pandas dataframe in the code block below:
# Write code to load the Titanic data sets
df_titanic = pd.read_excel("titanic.xlsx")
Selecting particular elements of a Data Frame#
Let us display the titanic data frame again:
df_titanic.head()
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Braund, Mr. Owen Harris | male | 22.0 | A/5 21171 | 7.2500 | NaN | 0 |
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | PC 17599 | 71.2833 | C85 | 1 |
| 2 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | 1 |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | C123 | 1 |
| 4 | 5 | Allen, Mr. William Henry | male | 35.0 | 373450 | 8.0500 | NaN | 0 |
So far we have learnt how to display a particular column of a data frame.
df_titanic["Name"]
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
How to select only entries that satisfy a particular condition. This is very similar to boolean indexing in numpy:
df_titanic[df_titanic["Age"] > 40]
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 6 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 11 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 33 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 35 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 862 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 865 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 873 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 8 columns
Note that the above has displayed a new data frame so that it contains only the elements that have Age value \(> 40\).
Let us save it in a new variable:
df_titanic_forty_plus = df_titanic[df_titanic["Age"] > 40]
df_titanic_forty_plus
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 6 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 11 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 33 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 35 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 862 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 865 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 873 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 8 columns
The index seems to be a bit odd, it links to the old df_titanic Data Frame, but what if we wanted this to start from \(0\).
df_titanic_forty_plus.reset_index()
| index | PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 1 | 11 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 2 | 15 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 3 | 33 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 4 | 35 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145 | 862 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 146 | 865 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 147 | 871 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 148 | 873 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 149 | 879 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 9 columns
The above has created a new data frame, and it saved the old index for you. What if we wanted to get rid of this index?
df_titanic_forty_plus.reset_index(drop=True)
| PassengerId | Name | Sex | Age | Ticket | Fare | Cabin | Survived | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | McCarthy, Mr. Timothy J | male | 54.0 | 17463 | 51.8625 | E46 | 0 |
| 1 | 12 | Bonnell, Miss. Elizabeth | female | 58.0 | 113783 | 26.5500 | C103 | 1 |
| 2 | 16 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 248706 | 16.0000 | NaN | 1 |
| 3 | 34 | Wheadon, Mr. Edward H | male | 66.0 | C.A. 24579 | 10.5000 | NaN | 0 |
| 4 | 36 | Holverson, Mr. Alexander Oskar | male | 42.0 | 113789 | 52.0000 | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145 | 863 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 17466 | 25.9292 | D17 | 1 |
| 146 | 866 | Bystrom, Mrs. (Karolina) | female | 42.0 | 236852 | 13.0000 | NaN | 1 |
| 147 | 872 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 11751 | 52.5542 | D35 | 1 |
| 148 | 874 | Vander Cruyssen, Mr. Victor | male | 47.0 | 345765 | 9.0000 | NaN | 0 |
| 149 | 880 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 11767 | 83.1583 | C50 | 1 |
150 rows × 8 columns
Exercise 3.1#
In the cell below filter the titanic data frame so that it only contains the people who survived sinking the titanic. Save it in a variable df_titanic_survived and display it.
Then reset its index and drop the old index. Then again overwrite df_titanic_survived with this result and display it again.
Selection by label#
What if we wanted to display only particular columns of a data frame. We know how to display one column, but we want more.
df_titanic[["PassengerId", "Name", "Sex"]]
| PassengerId | Name | Sex | |
|---|---|---|---|
| 0 | 1 | Braund, Mr. Owen Harris | male |
| 1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |
| 2 | 3 | Heikkinen, Miss. Laina | female |
| 3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female |
| 4 | 5 | Allen, Mr. William Henry | male |
| ... | ... | ... | ... |
| 886 | 887 | Montvila, Rev. Juozas | male |
| 887 | 888 | Graham, Miss. Margaret Edith | female |
| 888 | 889 | Johnston, Miss. Catherine Helen "Carrie" | female |
| 889 | 890 | Behr, Mr. Karl Howell | male |
| 890 | 891 | Dooley, Mr. Patrick | male |
891 rows × 3 columns
We can also use the location to get fewer rows (we have to make sure to use proper index values).
df_titanic.loc[5:10, ["PassengerId", "Name", "Sex"]]
| PassengerId | Name | Sex | |
|---|---|---|---|
| 5 | 6 | Moran, Mr. James | male |
| 6 | 7 | McCarthy, Mr. Timothy J | male |
| 7 | 8 | Palsson, Master. Gosta Leonard | male |
| 8 | 9 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female |
| 9 | 10 | Nasser, Mrs. Nicholas (Adele Achem) | female |
| 10 | 11 | Sandstrom, Miss. Marguerite Rut | female |
Exercise 3.2#
In the cell below create a data frame from df_titanic so that it contains only the columns “Name”, “Age” and “Survived” and it contains only 100 first rows.
Then filter it so that it only contains people whose age is less than 18.
Selection by position#
We can also select particular rows and columns by a numerical index. For that we use iloc rather than loc.
df_titanic.iloc[:10, 2:4]
| Sex | Age | |
|---|---|---|
| 0 | male | 22.0 |
| 1 | female | 38.0 |
| 2 | female | 26.0 |
| 3 | female | 35.0 |
| 4 | male | 35.0 |
| 5 | male | NaN |
| 6 | male | 54.0 |
| 7 | male | 2.0 |
| 8 | female | 27.0 |
| 9 | female | 14.0 |
Note that we displayed first 10 rows and the columns that have index 2 and 3.
df_titanic.columns[2], df_titanic.columns[3]
('Sex', 'Age')
Exercise 3.3#
In the cell below create a data frame from df_titanic so that it only contains the rows between 10th and 30th, and columns from 4th to 6th.
Important note that iloc works as follows#
DataFrame.iloc[ start:end:step, start:end:step]
The first entry represents rows the second entry represents columns.
We can also display single elements:
df_titanic.iloc[30, 7]
0
For single elements we recommend using iat, as it runs faster if the element is a scalar.
df_titanic.iat[30, 7]
0