Part 2: viewing data#

This workbook requires you to load the titanic and avocado datasets. You will also need to run the following block of code to import numpy and pandas:

import pandas as pd
import numpy as np

Load the titanic and avocado data sets as a pandas dataframes in the code block below:

# Load the Titanic and avocado data sets
df_titanic = pd.read_excel("titanic.xlsx")
df_avocado = pd.read_excel("avocado.xlsx")

Exploring datasets in more detail#

Let’s come back to our titanic example. We can access the index of the DataFrame as follows:

df_titanic.index

RangeIndex(start=0, stop=891, step=1)

By default it is a pandas RangeIndex type, it works similarly to range it starts at 0, the last entry is stop - 1, the step is 1.

We may use different types of indexing, but for now we are going to use the default one.

How about displaying all the column names of our Data Frame? We do it as follows:

df_titanic.columns

Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
       'Survived'],
      dtype='object')

This type can be treated as a list or numpy array, we can call its elements via an index.

df_titanic.columns[0], df_titanic.columns[-1]

('PassengerId', 'Survived')

We already know how to view values of a particular column, e.g.

df_titanic["Survived"]

If the name of the column does not contain spaces we can also view the values by

df_titanic.Survived

    0
    1
    1
    1
    0
      ..
  0
  1
  0
  1
  0
Name: Survived, Length: 891, dtype: int64

If we wish to get a numpy array from the pd.Series we use your_pd_series.values:

df_titanic.Survived.values

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0])

Pandas is very compatible with numpy, in fact, we can simply convert a DataFrame to a numpy array.

titanic_np_array = df_titanic.to_numpy()
print(titanic_np_array)

[[1 'Braund, Mr. Owen Harris' 'male' ... 7.25 nan 0]
 [2 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)' 'female' ...
  71.2833 'C85' 1]
 [3 'Heikkinen, Miss. Laina' 'female' ... 7.925 nan 1]
 ...
 [889 'Johnston, Miss. Catherine Helen "Carrie"' 'female' ... 23.45 nan 0]
 [890 'Behr, Mr. Karl Howell' 'male' ... 30.0 'C148' 1]
 [891 'Dooley, Mr. Patrick' 'male' ... 7.75 nan 0]]

We can also get a quick statistical summary of our data. This is done via:

df_titanic.describe()

	PassengerId	Age	Fare	Survived
count	891.000000	714.000000	891.000000	891.000000
mean	446.000000	29.699118	32.204208	0.383838
std	257.353842	14.526497	49.693429	0.486592
min	1.000000	0.420000	0.000000	0.000000
25%	223.500000	20.125000	7.910400	0.000000
50%	446.000000	28.000000	14.454200	0.000000
75%	668.500000	38.000000	31.000000	1.000000
max	891.000000	80.000000	512.329200	1.000000

Note that for pandas displaying produces nicer outputs than printing

print(df_titanic.describe())

       PassengerId         Age        Fare    Survived
count   891.000000  714.000000  891.000000  891.000000
mean    446.000000   29.699118   32.204208    0.383838
std     257.353842   14.526497   49.693429    0.486592
min       1.000000    0.420000    0.000000    0.000000
25%     223.500000   20.125000    7.910400    0.000000
50%     446.000000   28.000000   14.454200    0.000000
75%     668.500000   38.000000   31.000000    1.000000
max     891.000000   80.000000  512.329200    1.000000

Displaying can be also achieved through display command as follows:

display(df_titanic.describe())

	PassengerId	Age	Fare	Survived
count	891.000000	714.000000	891.000000	891.000000
mean	446.000000	29.699118	32.204208	0.383838
std	257.353842	14.526497	49.693429	0.486592
min	1.000000	0.420000	0.000000	0.000000
25%	223.500000	20.125000	7.910400	0.000000
50%	446.000000	28.000000	14.454200	0.000000
75%	668.500000	38.000000	31.000000	1.000000
max	891.000000	80.000000	512.329200	1.000000

Note that the above stats are only for the numerical columns.

Exercise 2.1#

With the avocado data frame from Exercise 0.1:

Please display its columns.

Display the stats of this data frame:

Display all the entries of this data frame in the column Total Bags.

Convert your data frame to a numpy array and then print it.

Transposing your data#

You may have heard about the transposing operation. In matrices, a transpose swaps the rows with columns. This operation makes sense with numpy arrays and Data Frames as well.

my_matrix = np.array([[1, 2], [3, 4]])
print(f"Original matrix \n {my_matrix}")
print(f"Transposed matrix \n {my_matrix.T}")

Original matrix 
 [[1 2]
 [3 4]]
Transposed matrix 
 [[1 3]
 [2 4]]

df_titanic_t = df_titanic.T
df_titanic_t

	0	1	2	3	4	5	6	7	8	9	...	881	882	883	884	885	886	887	888	889	890
PassengerId	1	2	3	4	5	6	7	8	9	10	...	882	883	884	885	886	887	888	889	890	891
Name	Braund, Mr. Owen Harris	Cumings, Mrs. John Bradley (Florence Briggs Th...	Heikkinen, Miss. Laina	Futrelle, Mrs. Jacques Heath (Lily May Peel)	Allen, Mr. William Henry	Moran, Mr. James	McCarthy, Mr. Timothy J	Palsson, Master. Gosta Leonard	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	Nasser, Mrs. Nicholas (Adele Achem)	...	Markun, Mr. Johann	Dahlberg, Miss. Gerda Ulrika	Banfield, Mr. Frederick James	Sutehall, Mr. Henry Jr	Rice, Mrs. William (Margaret Norton)	Montvila, Rev. Juozas	Graham, Miss. Margaret Edith	Johnston, Miss. Catherine Helen "Carrie"	Behr, Mr. Karl Howell	Dooley, Mr. Patrick
Sex	male	female	female	female	male	male	male	male	female	female	...	male	female	male	male	female	male	female	female	male	male
Age	22.0	38.0	26.0	35.0	35.0	NaN	54.0	2.0	27.0	14.0	...	33.0	22.0	28.0	25.0	39.0	27.0	19.0	NaN	26.0	32.0
Ticket	A/5 21171	PC 17599	STON/O2. 3101282	113803	373450	330877	17463	349909	347742	237736	...	349257	7552	C.A./SOTON 34068	SOTON/OQ 392076	382652	211536	112053	W./C. 6607	111369	370376
Fare	7.25	71.2833	7.925	53.1	8.05	8.4583	51.8625	21.075	11.1333	30.0708	...	7.8958	10.5167	10.5	7.05	29.125	13.0	30.0	23.45	30.0	7.75
Cabin	NaN	C85	NaN	C123	NaN	NaN	E46	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	B42	NaN	C148	NaN
Survived	0	1	1	1	0	0	0	0	1	1	...	0	0	0	0	0	0	1	0	1	0

8 rows × 891 columns

With this particular example it is not the best thing to do. However, we have got an interesting data frame. Let’s spend some time on it.

Exercise 2.2#

Display the index of the above transposed titanic DataFrame.

Note that it starts with an index but its elements can be accessed as lists.

M24568 and M20741 Intro to CompSMAP

Part 2: viewing data

Contents

Part 2: viewing data#

Exploring datasets in more detail#

Exercise 2.1#

Transposing your data#

Exercise 2.2#