Part 2: viewing data#

This workbook requires you to load the titanic and avocado datasets. You will also need to run the following block of code to import numpy and pandas:

import pandas as pd
import numpy as np

Load the titanic and avocado data sets as a pandas dataframes in the code block below:

# Load the Titanic and avocado data sets
df_titanic = pd.read_excel("titanic.xlsx")
df_avocado = pd.read_excel("avocado.xlsx")

Exploring datasets in more detail#

Let’s come back to our titanic example. We can access the index of the DataFrame as follows:

df_titanic.index
RangeIndex(start=0, stop=891, step=1)

By default it is a pandas RangeIndex type, it works similarly to range it starts at 0, the last entry is stop - 1, the step is 1.

We may use different types of indexing, but for now we are going to use the default one.

How about displaying all the column names of our Data Frame? We do it as follows:

df_titanic.columns
Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
       'Survived'],
      dtype='object')

This type can be treated as a list or numpy array, we can call its elements via an index.

df_titanic.columns[0], df_titanic.columns[-1]
('PassengerId', 'Survived')

We already know how to view values of a particular column, e.g.

df_titanic["Survived"]

If the name of the column does not contain spaces we can also view the values by

df_titanic.Survived
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

If we wish to get a numpy array from the pd.Series we use your_pd_series.values:

df_titanic.Survived.values
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0])

Pandas is very compatible with numpy, in fact, we can simply convert a DataFrame to a numpy array.

titanic_np_array = df_titanic.to_numpy()
print(titanic_np_array)
[[1 'Braund, Mr. Owen Harris' 'male' ... 7.25 nan 0]
 [2 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)' 'female' ...
  71.2833 'C85' 1]
 [3 'Heikkinen, Miss. Laina' 'female' ... 7.925 nan 1]
 ...
 [889 'Johnston, Miss. Catherine Helen "Carrie"' 'female' ... 23.45 nan 0]
 [890 'Behr, Mr. Karl Howell' 'male' ... 30.0 'C148' 1]
 [891 'Dooley, Mr. Patrick' 'male' ... 7.75 nan 0]]

We can also get a quick statistical summary of our data. This is done via:

df_titanic.describe()
PassengerId Age Fare Survived
count 891.000000 714.000000 891.000000 891.000000
mean 446.000000 29.699118 32.204208 0.383838
std 257.353842 14.526497 49.693429 0.486592
min 1.000000 0.420000 0.000000 0.000000
25% 223.500000 20.125000 7.910400 0.000000
50% 446.000000 28.000000 14.454200 0.000000
75% 668.500000 38.000000 31.000000 1.000000
max 891.000000 80.000000 512.329200 1.000000

Note that for pandas displaying produces nicer outputs than printing

print(df_titanic.describe())
       PassengerId         Age        Fare    Survived
count   891.000000  714.000000  891.000000  891.000000
mean    446.000000   29.699118   32.204208    0.383838
std     257.353842   14.526497   49.693429    0.486592
min       1.000000    0.420000    0.000000    0.000000
25%     223.500000   20.125000    7.910400    0.000000
50%     446.000000   28.000000   14.454200    0.000000
75%     668.500000   38.000000   31.000000    1.000000
max     891.000000   80.000000  512.329200    1.000000

Displaying can be also achieved through display command as follows:

display(df_titanic.describe())
PassengerId Age Fare Survived
count 891.000000 714.000000 891.000000 891.000000
mean 446.000000 29.699118 32.204208 0.383838
std 257.353842 14.526497 49.693429 0.486592
min 1.000000 0.420000 0.000000 0.000000
25% 223.500000 20.125000 7.910400 0.000000
50% 446.000000 28.000000 14.454200 0.000000
75% 668.500000 38.000000 31.000000 1.000000
max 891.000000 80.000000 512.329200 1.000000

Note that the above stats are only for the numerical columns.

Exercise 2.1#

With the avocado data frame from Exercise 0.1:

Please display its columns.

Display the stats of this data frame:

Display all the entries of this data frame in the column Total Bags.

Convert your data frame to a numpy array and then print it.

Transposing your data#

You may have heard about the transposing operation. In matrices, a transpose swaps the rows with columns. This operation makes sense with numpy arrays and Data Frames as well.

my_matrix = np.array([[1, 2], [3, 4]])
print(f"Original matrix \n {my_matrix}")
print(f"Transposed matrix \n {my_matrix.T}")
Original matrix 
 [[1 2]
 [3 4]]
Transposed matrix 
 [[1 3]
 [2 4]]
df_titanic_t = df_titanic.T
df_titanic_t
0 1 2 3 4 5 6 7 8 9 ... 881 882 883 884 885 886 887 888 889 890
PassengerId 1 2 3 4 5 6 7 8 9 10 ... 882 883 884 885 886 887 888 889 890 891
Name Braund, Mr. Owen Harris Cumings, Mrs. John Bradley (Florence Briggs Th... Heikkinen, Miss. Laina Futrelle, Mrs. Jacques Heath (Lily May Peel) Allen, Mr. William Henry Moran, Mr. James McCarthy, Mr. Timothy J Palsson, Master. Gosta Leonard Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Nasser, Mrs. Nicholas (Adele Achem) ... Markun, Mr. Johann Dahlberg, Miss. Gerda Ulrika Banfield, Mr. Frederick James Sutehall, Mr. Henry Jr Rice, Mrs. William (Margaret Norton) Montvila, Rev. Juozas Graham, Miss. Margaret Edith Johnston, Miss. Catherine Helen "Carrie" Behr, Mr. Karl Howell Dooley, Mr. Patrick
Sex male female female female male male male male female female ... male female male male female male female female male male
Age 22.0 38.0 26.0 35.0 35.0 NaN 54.0 2.0 27.0 14.0 ... 33.0 22.0 28.0 25.0 39.0 27.0 19.0 NaN 26.0 32.0
Ticket A/5 21171 PC 17599 STON/O2. 3101282 113803 373450 330877 17463 349909 347742 237736 ... 349257 7552 C.A./SOTON 34068 SOTON/OQ 392076 382652 211536 112053 W./C. 6607 111369 370376
Fare 7.25 71.2833 7.925 53.1 8.05 8.4583 51.8625 21.075 11.1333 30.0708 ... 7.8958 10.5167 10.5 7.05 29.125 13.0 30.0 23.45 30.0 7.75
Cabin NaN C85 NaN C123 NaN NaN E46 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN B42 NaN C148 NaN
Survived 0 1 1 1 0 0 0 0 1 1 ... 0 0 0 0 0 0 1 0 1 0

8 rows × 891 columns

With this particular example it is not the best thing to do. However, we have got an interesting data frame. Let’s spend some time on it.

Exercise 2.2#

Display the index of the above transposed titanic DataFrame.

Note that it starts with an index but its elements can be accessed as lists.