# Part 2: viewing data

This workbook requires you to load the `` titanic`` and ``avocado`` datasets. You will also need to run the following block of code to import ``numpy`` and ``pandas``:

In [1]:
import pandas as pd
import numpy as np

Load the ``titanic`` and ``avocado`` data sets as a ``pandas`` dataframes in the code block below:

In [2]:
# Load the Titanic and avocado data sets
df_titanic = pd.read_excel("titanic.xlsx")
df_avocado = pd.read_excel("avocado.xlsx")

## Exploring datasets in more detail

Let's come back to our ``titanic`` example. We can access the index of the DataFrame as follows:


In [3]:
df_titanic.index

RangeIndex(start=0, stop=891, step=1)

By default it is a pandas RangeIndex type, it works similarly to ``range`` it starts at 0, the last entry is stop - 1, the step is 1.

We may use different types of indexing, but for now we are going to use the default one.

How about displaying all the column names of our Data Frame?
We do it as follows:

In [4]:
df_titanic.columns

Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
       'Survived'],
      dtype='object')

This type can be treated as a list or numpy array, we can call its elements via an index.

In [5]:
df_titanic.columns[0], df_titanic.columns[-1]

('PassengerId', 'Survived')

We already know how to view values of a particular column, e.g.

``df_titanic["Survived"]``

If the name of the column does not contain spaces we can also view the values by

In [6]:
df_titanic.Survived

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,1
888,0
889,1


If we wish to get a numpy array from the pd.Series we use ``your_pd_series.values``:

In [7]:
df_titanic.Survived.values

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,

Pandas is very compatible with numpy, in fact, we can simply convert a DataFrame to a numpy array.

In [8]:
titanic_np_array = df_titanic.to_numpy()
print(titanic_np_array)

[[1 'Braund, Mr. Owen Harris' 'male' ... 7.25 nan 0]
 [2 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)' 'female' ...
  71.2833 'C85' 1]
 [3 'Heikkinen, Miss. Laina' 'female' ... 7.925 nan 1]
 ...
 [889 'Johnston, Miss. Catherine Helen "Carrie"' 'female' ... 23.45 nan 0]
 [890 'Behr, Mr. Karl Howell' 'male' ... 30.0 'C148' 1]
 [891 'Dooley, Mr. Patrick' 'male' ... 7.75 nan 0]]


We can also get a quick statistical summary of our data. This is done via:

In [9]:
df_titanic.describe()

Unnamed: 0,PassengerId,Age,Fare,Survived
count,891.0,714.0,891.0,891.0
mean,446.0,29.699118,32.204208,0.383838
std,257.353842,14.526497,49.693429,0.486592
min,1.0,0.42,0.0,0.0
25%,223.5,20.125,7.9104,0.0
50%,446.0,28.0,14.4542,0.0
75%,668.5,38.0,31.0,1.0
max,891.0,80.0,512.3292,1.0


Note that for pandas ``displaying`` produces nicer outputs than ``printing``

In [10]:
print(df_titanic.describe())

       PassengerId         Age        Fare    Survived
count   891.000000  714.000000  891.000000  891.000000
mean    446.000000   29.699118   32.204208    0.383838
std     257.353842   14.526497   49.693429    0.486592
min       1.000000    0.420000    0.000000    0.000000
25%     223.500000   20.125000    7.910400    0.000000
50%     446.000000   28.000000   14.454200    0.000000
75%     668.500000   38.000000   31.000000    1.000000
max     891.000000   80.000000  512.329200    1.000000


Displaying can be also achieved through display command as follows:

In [11]:
display(df_titanic.describe())

Unnamed: 0,PassengerId,Age,Fare,Survived
count,891.0,714.0,891.0,891.0
mean,446.0,29.699118,32.204208,0.383838
std,257.353842,14.526497,49.693429,0.486592
min,1.0,0.42,0.0,0.0
25%,223.5,20.125,7.9104,0.0
50%,446.0,28.0,14.4542,0.0
75%,668.5,38.0,31.0,1.0
max,891.0,80.0,512.3292,1.0


Note that the above stats are only for the numerical columns.

### Exercise 2.1

With the avocado data frame from Exercise 0.1:

Please display its columns.

In [12]:
df_avocado.columns

Index([        'Date', 'AveragePrice', 'Total Volume',           4046,
                 4225,           4770,   'Total Bags',   'Small Bags',
         'Large Bags',  'XLarge Bags',         'type',         'year',
             'region'],
      dtype='object')

Display the stats of this data frame:

In [13]:
df_avocado.describe()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,year
count,18249,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0
mean,2016-08-13 23:30:43.498273792,1.405978,850644.0,293008.4,295154.6,22839.74,239639.2,182194.7,54338.09,3106.426507,2016.147899
min,2015-01-04 00:00:00,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0
25%,2015-10-25 00:00:00,1.1,10838.58,854.07,3008.78,0.0,5088.64,2849.42,127.47,0.0,2015.0
50%,2016-08-14 00:00:00,1.37,107376.8,8645.3,29061.02,184.99,39743.83,26362.82,2647.71,0.0,2016.0
75%,2017-06-04 00:00:00,1.66,432962.3,111020.2,150206.9,6243.42,110783.4,83337.67,22029.25,132.5,2017.0
max,2018-03-25 00:00:00,3.25,62505650.0,22743620.0,20470570.0,2546439.0,19373130.0,13384590.0,5719097.0,551693.65,2018.0
std,,0.402677,3453545.0,1264989.0,1204120.0,107464.1,986242.4,746178.5,243966.0,17692.894652,0.939938


Display all the entries of this data frame in the column ``Total Bags``.

In [14]:
df_avocado["Total Bags"]

Unnamed: 0,Total Bags
0,8696.87
1,9505.56
2,8145.35
3,5811.16
4,6183.95
...,...
18244,13498.67
18245,9264.84
18246,9394.11
18247,10969.54


Convert your data frame to a numpy array and then print it.

In [15]:
avocado_np_array = df_avocado.to_numpy()
print(avocado_np_array)

[[Timestamp('2015-12-27 00:00:00') 1.33 64236.62 ... 'conventional' 2015
  'Albany']
 [Timestamp('2015-12-20 00:00:00') 1.35 54876.98 ... 'conventional' 2015
  'Albany']
 [Timestamp('2015-12-13 00:00:00') 0.93 118220.22 ... 'conventional' 2015
  'Albany']
 ...
 [Timestamp('2018-01-21 00:00:00') 1.87 13766.76 ... 'organic' 2018
  'WestTexNewMexico']
 [Timestamp('2018-01-14 00:00:00') 1.93 16205.22 ... 'organic' 2018
  'WestTexNewMexico']
 [Timestamp('2018-01-07 00:00:00') 1.62 17489.58 ... 'organic' 2018
  'WestTexNewMexico']]


### Transposing your data

You may have heard about the transposing operation. In matrices, a transpose swaps the rows with columns. This operation makes sense with numpy arrays and Data Frames as well.

In [16]:
my_matrix = np.array([[1, 2], [3, 4]])
print(f"Original matrix \n {my_matrix}")
print(f"Transposed matrix \n {my_matrix.T}")

Original matrix 
 [[1 2]
 [3 4]]
Transposed matrix 
 [[1 3]
 [2 4]]


In [17]:
df_titanic_t = df_titanic.T
df_titanic_t

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,881,882,883,884,885,886,887,888,889,890
PassengerId,1,2,3,4,5,6,7,8,9,10,...,882,883,884,885,886,887,888,889,890,891
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry","Moran, Mr. James","McCarthy, Mr. Timothy J","Palsson, Master. Gosta Leonard","Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)","Nasser, Mrs. Nicholas (Adele Achem)",...,"Markun, Mr. Johann","Dahlberg, Miss. Gerda Ulrika","Banfield, Mr. Frederick James","Sutehall, Mr. Henry Jr","Rice, Mrs. William (Margaret Norton)","Montvila, Rev. Juozas","Graham, Miss. Margaret Edith","Johnston, Miss. Catherine Helen ""Carrie""","Behr, Mr. Karl Howell","Dooley, Mr. Patrick"
Sex,male,female,female,female,male,male,male,male,female,female,...,male,female,male,male,female,male,female,female,male,male
Age,22.0,38.0,26.0,35.0,35.0,,54.0,2.0,27.0,14.0,...,33.0,22.0,28.0,25.0,39.0,27.0,19.0,,26.0,32.0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450,330877,17463,349909,347742,237736,...,349257,7552,C.A./SOTON 34068,SOTON/OQ 392076,382652,211536,112053,W./C. 6607,111369,370376
Fare,7.25,71.2833,7.925,53.1,8.05,8.4583,51.8625,21.075,11.1333,30.0708,...,7.8958,10.5167,10.5,7.05,29.125,13.0,30.0,23.45,30.0,7.75
Cabin,,C85,,C123,,,E46,,,,...,,,,,,,B42,,C148,
Survived,0,1,1,1,0,0,0,0,1,1,...,0,0,0,0,0,0,1,0,1,0


With this particular example it is not the best thing to do. However, we have got an interesting data frame. Let's spend some time on it.

### Exercise 2.2

Display the index of the above transposed titanic DataFrame.

In [19]:
df_titanic_t.index

Index(['PassengerId', 'Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin',
       'Survived'],
      dtype='object')

Note that it starts with an index but its elements can be accessed as lists.