Introduction - SOLVED#

a panda

In this lecture, we will introduce you to the data analysis library for Python pandas. pandas is used extensively analysing and manipulating data, designed to be open source, powerful, flexible and easy to use. It is used widely in a variety of academic and commercial settings.

OUTLINE:

  • DataFrames - dictionaries on steroids

  • Viewing data

  • Selection - Selecting particular elements of a Data Frame

  • More operations on DataFrame’s rows/columns/entries

  • Grouping

  • Visualisation of DataFrames

  • Importing and Exporting files/spreadsheets.

  • Summary Exercises

Importing Pandas#

Like numpy, pandas is a libary and will therefore need to be imported before we can use its commands. We’ll also be making use of the numpy library, so let’s import this at the same time as pandas. We do this by running the following cell:

import pandas as pd # we will be using pandas via the abbreviation pd
import numpy as np

## Important: please read below

Before starting this week’s material please download the following files from Moodle:

  • titanic.xlsx

  • avocado.xlsx

  • C19_portsmouth.xlsx

Upload the titanic and avocado data files to this notebook (ask how if you’ve forgotten).

Each individual notebook will start with a section telling you which data files are required for the exercises contained in that notebook.

Spreadsheets in Pandas - No more Excel!#

In excel or google sheets you deal with spreadsheets. Here is an example of a Titanic spreadsheet containing the basic information about the passengers including if they survived the incident (0 if did not survived, 1 if survived).

titanicspreadsheet

The same spreadsheet can be opened by pandas and viewed as so-called DataFrame. We have attached this spreadsheet to this notebook; it is contained in the titanic.xlsx file.

To import it as pandas Data Frame we use:

df_titanic = pd.read_excel("titanic.xlsx")

First let’s explain the meaning of each column:#

  • PassengerId, an id was assigned to every passenger,

  • Name contains the full name of the passenger,

  • Sex is the stated gender of a passenger.

  • Age contains the age of the passenger during the cruise.

  • Ticket contains ticket number

  • Fare contains passenger fare.

  • Cabin contains the cabin number for each passenger.

  • Survived contains information if the passenger survived the sinking of titanic (0 = No; 1 = Yes)

To display the header of this file we use:

df_titanic.head() #run this cell
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 1 Braund, Mr. Owen Harris male 22.0 A/5 21171 7.2500 NaN 0
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
2 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 1
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
4 5 Allen, Mr. William Henry male 35.0 373450 8.0500 NaN 0

You can also display the tail (end rows) of this spreadsheet:

df_titanic.tail()
PassengerId Name Sex Age Ticket Fare Cabin Survived
886 887 Montvila, Rev. Juozas male 27.0 211536 13.00 NaN 0
887 888 Graham, Miss. Margaret Edith female 19.0 112053 30.00 B42 1
888 889 Johnston, Miss. Catherine Helen "Carrie" female NaN W./C. 6607 23.45 NaN 0
889 890 Behr, Mr. Karl Howell male 26.0 111369 30.00 C148 1
890 891 Dooley, Mr. Patrick male 32.0 370376 7.75 NaN 0

Both head and tail function take an integer input, inputting an integer \(n\) allows us to display \(n\) entries

n = 8
df_titanic.head(n)
PassengerId Name Sex Age Ticket Fare Cabin Survived
0 1 Braund, Mr. Owen Harris male 22.0 A/5 21171 7.2500 NaN 0
1 2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 PC 17599 71.2833 C85 1
2 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN 1
3 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 113803 53.1000 C123 1
4 5 Allen, Mr. William Henry male 35.0 373450 8.0500 NaN 0
5 6 Moran, Mr. James male NaN 330877 8.4583 NaN 0
6 7 McCarthy, Mr. Timothy J male 54.0 17463 51.8625 E46 0
7 8 Palsson, Master. Gosta Leonard male 2.0 349909 21.0750 NaN 0
df_titanic.tail(n)
PassengerId Name Sex Age Ticket Fare Cabin Survived
883 884 Banfield, Mr. Frederick James male 28.0 C.A./SOTON 34068 10.500 NaN 0
884 885 Sutehall, Mr. Henry Jr male 25.0 SOTON/OQ 392076 7.050 NaN 0
885 886 Rice, Mrs. William (Margaret Norton) female 39.0 382652 29.125 NaN 0
886 887 Montvila, Rev. Juozas male 27.0 211536 13.000 NaN 0
887 888 Graham, Miss. Margaret Edith female 19.0 112053 30.000 B42 1
888 889 Johnston, Miss. Catherine Helen "Carrie" female NaN W./C. 6607 23.450 NaN 0
889 890 Behr, Mr. Karl Howell male 26.0 111369 30.000 C148 1
890 891 Dooley, Mr. Patrick male 32.0 370376 7.750 NaN 0

Exercise 0.1#

Import the file avocado.xlsx as pandas Data Frame.

df_avocado = pd.read_excel("avocado.xlsx")

Display its header and its tail:

df_avocado.head()
Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany
df_avocado.tail()
Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
18244 2018-02-04 1.63 17074.83 2046.96 1529.20 0.00 13498.67 13066.82 431.85 0.0 organic 2018 WestTexNewMexico
18245 2018-01-28 1.71 13888.04 1191.70 3431.50 0.00 9264.84 8940.04 324.80 0.0 organic 2018 WestTexNewMexico
18246 2018-01-21 1.87 13766.76 1191.92 2452.79 727.94 9394.11 9351.80 42.31 0.0 organic 2018 WestTexNewMexico
18247 2018-01-14 1.93 16205.22 1527.63 2981.04 727.01 10969.54 10919.54 50.00 0.0 organic 2018 WestTexNewMexico
18248 2018-01-07 1.62 17489.58 2894.77 2356.13 224.53 12014.15 11988.14 26.01 0.0 organic 2018 WestTexNewMexico

Display first 20 rows of the header:

n = 20
df_avocado.head(n)
Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany
5 2015-11-22 1.26 55979.78 1184.27 48067.99 43.61 6683.91 6556.47 127.44 0.0 conventional 2015 Albany
6 2015-11-15 0.99 83453.76 1368.92 73672.72 93.26 8318.86 8196.81 122.05 0.0 conventional 2015 Albany
7 2015-11-08 0.98 109428.33 703.75 101815.36 80.00 6829.22 6266.85 562.37 0.0 conventional 2015 Albany
8 2015-11-01 1.02 99811.42 1022.15 87315.57 85.34 11388.36 11104.53 283.83 0.0 conventional 2015 Albany
9 2015-10-25 1.07 74338.76 842.40 64757.44 113.00 8625.92 8061.47 564.45 0.0 conventional 2015 Albany
10 2015-10-18 1.12 84843.44 924.86 75595.85 117.07 8205.66 7877.86 327.80 0.0 conventional 2015 Albany
11 2015-10-11 1.28 64489.17 1582.03 52677.92 105.32 10123.90 9866.27 257.63 0.0 conventional 2015 Albany
12 2015-10-04 1.31 61007.10 2268.32 49880.67 101.36 8756.75 8379.98 376.77 0.0 conventional 2015 Albany
13 2015-09-27 0.99 106803.39 1204.88 99409.21 154.84 6034.46 5888.87 145.59 0.0 conventional 2015 Albany
14 2015-09-20 1.33 69759.01 1028.03 59313.12 150.50 9267.36 8489.10 778.26 0.0 conventional 2015 Albany
15 2015-09-13 1.28 76111.27 985.73 65696.86 142.00 9286.68 8665.19 621.49 0.0 conventional 2015 Albany
16 2015-09-06 1.11 99172.96 879.45 90062.62 240.79 7990.10 7762.87 227.23 0.0 conventional 2015 Albany
17 2015-08-30 1.07 105693.84 689.01 94362.67 335.43 10306.73 10218.93 87.80 0.0 conventional 2015 Albany
18 2015-08-23 1.34 79992.09 733.16 67933.79 444.78 10880.36 10745.79 134.57 0.0 conventional 2015 Albany
19 2015-08-16 1.33 80043.78 539.65 68666.01 394.90 10443.22 10297.68 145.54 0.0 conventional 2015 Albany