Part 7: Importing and exporting files / spreadsheets#

This workbook does not require you to import any new datasets. Instead, it will make use of sample data provided to use in pandas. However, as usual you will need to import numpy and pandas

# Use this cell to import numpy and pandas
import numpy as np
import pandas as pd

Importing and exporting datasets#

So far we have learnt how to import files such as Excel files. However, often the data will be available to you in a .csv format, this is a spreadsheet without fancy Excel formatting.

In this section we will learn how to import csv.

We will also learn how to export our data frames to csv and xlsx files.

Let us import csv from sample_data folder

df_california_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_california_housing.head()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [2], in <module>
----> 1 df_california_housing = pd.read_csv("sample_data/california_housing_train.csv")
      2 df_california_housing.head()

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/california_housing_train.csv'

let us add a column bedroom per person

df_california_housing["bedroom per person"] = df_california_housing["total_bedrooms"] / df_california_housing["population"]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [3], in <module>
----> 1 df_california_housing["bedroom per person"] = df_california_housing["total_bedrooms"] / df_california_housing["population"]

NameError: name 'df_california_housing' is not defined
df_california_housing.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [4], in <module>
----> 1 df_california_housing.head()

NameError: name 'df_california_housing' is not defined

Let us now export this dataframe to a csv and xls spreadsheets

df_california_housing.to_csv("california_housing_new.csv", index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [5], in <module>
----> 1 df_california_housing.to_csv("california_housing_new.csv", index=False)

NameError: name 'df_california_housing' is not defined
df_california_housing.to_excel("california_housing_new.xlsx", index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [6], in <module>
----> 1 df_california_housing.to_excel("california_housing_new.xlsx", index=False)

NameError: name 'df_california_housing' is not defined

Exercise 7.1#

Import mnist_test.csv from sample data as a data frame. Call the dataframe df_mnist. We will need to import it without the header. To achieve this set header parameter to None inside read_csv function. After that display the data frame.

Each row consists of the label of the handwritten digit, e.g., first row represents 7, and 784 pixels, value 255 represents white and 0 represents black. Originally these were 28 x 28 pixels images representing handwritten digits. We will reshape and plot one row.

import matplotlib.pyplot as plt
seven = df_mnist.iloc[0,1:].values.reshape(28, 28)
plt.imshow(seven)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <module>
----> 1 seven = df_mnist.iloc[0,1:].values.reshape(28, 28)
      2 plt.imshow(seven)

NameError: name 'df_mnist' is not defined

Your task is to plot the first ten rows as “digit images”. Please use the for loop. Don’t forget to use plt.show() after plt.imshow, otherwise you will see only one picture.