Part 4: More operations on DataFrames
Contents
Part 4: More operations on DataFrames#
This notebook will make use of the titanic dataset, which you will need to load. You will also need to import pandas and numpy by completing the cells below:
# complete this cell to import numpy and pandas
Use the cell below to import the titanic dataset as a pandas dataframe:
# Load the titanic dataset
df_titanic = pd.read_excel("titanic.xlsx")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [2], in <module>
1 # Load the titanic dataset
----> 2 df_titanic = pd.read_excel("titanic.xlsx")
NameError: name 'pd' is not defined
Operations on rows, columns and entries#
We have already seen how to display stats of a data frame. How about individual stats?
df_titanic.mean(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [3], in <module>
----> 1 df_titanic.mean(numeric_only=True)
NameError: name 'df_titanic' is not defined
The above displayed the average of every column. How about the average of every numerical row? Remember the axis argument in numpy? You can use it here.
df_titanic.mean(axis=1,numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [4], in <module>
----> 1 df_titanic.mean(axis=1,numeric_only=True)
NameError: name 'df_titanic' is not defined
We can also use the other functions such as sum, std, median, max, min.
df_titanic.sum(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [5], in <module>
----> 1 df_titanic.sum(numeric_only=True)
NameError: name 'df_titanic' is not defined
df_titanic.std(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [6], in <module>
----> 1 df_titanic.std(numeric_only=True)
NameError: name 'df_titanic' is not defined
df_titanic.median(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [7], in <module>
----> 1 df_titanic.median(numeric_only=True)
NameError: name 'df_titanic' is not defined
df_titanic.max(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [8], in <module>
----> 1 df_titanic.max(numeric_only=True)
NameError: name 'df_titanic' is not defined
df_titanic.min(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [9], in <module>
----> 1 df_titanic.min(numeric_only=True)
NameError: name 'df_titanic' is not defined
Exercise 4.1#
From the df_titanic data frame find the total number of passengers who survived the sinking.
From the df_titanic data frame find the average price of the fare.
Correlation.#
To find how are the columns related to each other, for that we display the correlation matrix.
The correlation is a coefficient between \(-1\) and \(1\):
\(0\) correlation, or very low absolute value of correlation means that both columns are almost not related at all.
\(1\) correlation, or very high value of correlation means that they are related and their behaviours are alike.
\(-1\) correlation, or very low value of correlation means that they are related and their behaviour are opposite.
df_titanic.corr(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [10], in <module>
----> 1 df_titanic.corr(numeric_only=True)
NameError: name 'df_titanic' is not defined
Exercise 4.2#
Take a moment to look at the above correlations, do you understand why on the main diagonal the value of the correlation is 1?
String operations on columns#
If we wish to split a values of data frame columns with respect to some separator (e.g. comma, fullstop, space). We can do it as follows:
surnames = df_titanic.Name.str.split(",")
surnames
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [11], in <module>
----> 1 surnames = df_titanic.Name.str.split(",")
2 surnames
NameError: name 'df_titanic' is not defined
To extract the surname we can use the for loop
surnames = pd.Series([surname[0] for surname in surnames.values])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [12], in <module>
----> 1 surnames = pd.Series([surname[0] for surname in surnames.values])
NameError: name 'pd' is not defined
Exercise 4.5#
We wish to check all unique surnames from the pd.Series surnames.
First check the size of surnames check its size similarly as you would do that in the numpy array, using surnames.size.
To get only the unique surnames use your_pd_series.unique() and then print its size.
Why do you think the number reduced?
We can also replace symbols inside the column that contains strings:
df_titanic.Name.str.replace(".", "||")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [13], in <module>
----> 1 df_titanic.Name.str.replace(".", "||")
NameError: name 'df_titanic' is not defined
Exercise 4.6#
In the df_titanic.Name column replace all the dots and commas with |.
Then display your result
We can also check use str for selection:
df_titanic[df_titanic.Name.str.contains("Mrs.")]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [14], in <module>
----> 1 df_titanic[df_titanic.Name.str.contains("Mrs.")]
NameError: name 'df_titanic' is not defined
Exercise 4.7#
Was there a doctor (Dr. ) on board? Find the list of all doctors (hopefully these are medial doctors not academic ones … I (Ian) deliberately never use my Dr title when travelling for this reason!). Sam has been woken up on a flight where a medical doctor was needed rather than an astrophysicist. From the list of all doctors display their name and the Cabin.