Part 4: More operations on DataFrames#

This notebook will make use of the titanic dataset, which you will need to load. You will also need to import pandas and numpy by completing the cells below:

# complete this cell to import numpy and pandas

Use the cell below to import the titanic dataset as a pandas dataframe:

# Load the titanic dataset
df_titanic = pd.read_excel("titanic.xlsx")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [2], in <module>
      1 # Load the titanic dataset
----> 2 df_titanic = pd.read_excel("titanic.xlsx")

NameError: name 'pd' is not defined

Operations on rows, columns and entries#

We have already seen how to display stats of a data frame. How about individual stats?

df_titanic.mean(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [3], in <module>
----> 1 df_titanic.mean(numeric_only=True)

NameError: name 'df_titanic' is not defined

The above displayed the average of every column. How about the average of every numerical row? Remember the axis argument in numpy? You can use it here.

df_titanic.mean(axis=1,numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [4], in <module>
----> 1 df_titanic.mean(axis=1,numeric_only=True)

NameError: name 'df_titanic' is not defined

We can also use the other functions such as sum, std, median, max, min.

df_titanic.sum(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [5], in <module>
----> 1 df_titanic.sum(numeric_only=True)

NameError: name 'df_titanic' is not defined
df_titanic.std(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [6], in <module>
----> 1 df_titanic.std(numeric_only=True)

NameError: name 'df_titanic' is not defined
df_titanic.median(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [7], in <module>
----> 1 df_titanic.median(numeric_only=True)

NameError: name 'df_titanic' is not defined
df_titanic.max(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <module>
----> 1 df_titanic.max(numeric_only=True)

NameError: name 'df_titanic' is not defined
df_titanic.min(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [9], in <module>
----> 1 df_titanic.min(numeric_only=True)

NameError: name 'df_titanic' is not defined

Exercise 4.1#

From the df_titanic data frame find the total number of passengers who survived the sinking.

From the df_titanic data frame find the average price of the fare.

Correlation.#

To find how are the columns related to each other, for that we display the correlation matrix.

The correlation is a coefficient between \(-1\) and \(1\):

  • \(0\) correlation, or very low absolute value of correlation means that both columns are almost not related at all.

  • \(1\) correlation, or very high value of correlation means that they are related and their behaviours are alike.

  • \(-1\) correlation, or very low value of correlation means that they are related and their behaviour are opposite.

df_titanic.corr(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [10], in <module>
----> 1 df_titanic.corr(numeric_only=True)

NameError: name 'df_titanic' is not defined

Exercise 4.2#

Take a moment to look at the above correlations, do you understand why on the main diagonal the value of the correlation is 1?

String operations on columns#

If we wish to split a values of data frame columns with respect to some separator (e.g. comma, fullstop, space). We can do it as follows:

surnames = df_titanic.Name.str.split(",")
surnames
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [11], in <module>
----> 1 surnames = df_titanic.Name.str.split(",")
      2 surnames

NameError: name 'df_titanic' is not defined

To extract the surname we can use the for loop

surnames = pd.Series([surname[0] for surname in surnames.values])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [12], in <module>
----> 1 surnames = pd.Series([surname[0] for surname in surnames.values])

NameError: name 'pd' is not defined

Exercise 4.5#

We wish to check all unique surnames from the pd.Series surnames.

First check the size of surnames check its size similarly as you would do that in the numpy array, using surnames.size.

To get only the unique surnames use your_pd_series.unique() and then print its size.

Why do you think the number reduced?

We can also replace symbols inside the column that contains strings:

df_titanic.Name.str.replace(".", "||")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [13], in <module>
----> 1 df_titanic.Name.str.replace(".", "||")

NameError: name 'df_titanic' is not defined

Exercise 4.6#

In the df_titanic.Name column replace all the dots and commas with |. Then display your result

We can also check use str for selection:

df_titanic[df_titanic.Name.str.contains("Mrs.")]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [14], in <module>
----> 1 df_titanic[df_titanic.Name.str.contains("Mrs.")]

NameError: name 'df_titanic' is not defined

Exercise 4.7#

Was there a doctor (Dr. ) on board? Find the list of all doctors (hopefully these are medial doctors not academic ones … I (Ian) deliberately never use my Dr title when travelling for this reason!). Sam has been woken up on a flight where a medical doctor was needed rather than an astrophysicist. From the list of all doctors display their name and the Cabin.