Part 7: Importing and exporting files / spreadsheets - SOLVED#

This workbook does not require you to import any new datasets. Instead, it will make use of sample data provided to use in pandas. However, as usual you will need to import numpy and pandas

# Use this cell to import numpy and pandas
import numpy as np
import pandas as pd

Importing and exporting datasets#

So far we have learnt how to import files such as Excel files. However, often the data will be available to you in a .csv format, this is a spreadsheet without fancy Excel formatting.

In this section we will learn how to import csv.

We will also learn how to export our data frames to csv and xlsx files.

Let us import csv from sample_data folder

df_california_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_california_housing.head()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [2], in <module>
----> 1 df_california_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_california_housing.head()

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/california_housing_train.csv'

let us add a column bedroom per person

df_california_housing["bedroom per person"] = df_california_housing["total_bedrooms"] / df_california_housing["population"]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [3], in <module>
----> 1 df_california_housing["bedroom per person"] = df_california_housing["total_bedrooms"] / df_california_housing["population"]

NameError: name 'df_california_housing' is not defined

df_california_housing.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [4], in <module>
----> 1 df_california_housing.head()

NameError: name 'df_california_housing' is not defined

Let us now export this dataframe to a csv and xls spreadsheets

df_california_housing.to_csv("california_housing_new.csv", index=False)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [5], in <module>
----> 1 df_california_housing.to_csv("california_housing_new.csv", index=False)

NameError: name 'df_california_housing' is not defined

df_california_housing.to_excel("california_housing_new.xlsx", index=False)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [6], in <module>
----> 1 df_california_housing.to_excel("california_housing_new.xlsx", index=False)

NameError: name 'df_california_housing' is not defined

Exercise 7.1#

Import mnist_test.csv from sample data as a data frame. Call the dataframe df_mnist. We will need to import it without the header. To achieve this set header parameter to None inside read_csv function. After that display the data frame.

df_mnist = pd.read_csv("sample_data/mnist_test.csv", header=None)
df_mnist

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [7], in <module>
----> 1 df_mnist = pd.read_csv("sample_data/mnist_test.csv", header=None)
df_mnist

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/mnist_test.csv'

Each row consists of the label of the handwritten digit, e.g., first row represents 7, and 784 pixels, value 255 represents white and 0 represents black. Originally these were 28 x 28 pixels images representing handwritten digits. We will reshape and plot one row.

import matplotlib.pyplot as plt

seven = df_mnist.iloc[0,1:].values.reshape(28, 28)
plt.imshow(seven)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [9], in <module>
----> 1 seven = df_mnist.iloc[0,1:].values.reshape(28, 28)
      2 plt.imshow(seven)

NameError: name 'df_mnist' is not defined

Your task is to plot the first ten rows as “digit images”. Please use the for loop. Don’t forget to use plt.show() after plt.imshow, otherwise you will see only one picture.

for k in range(0,10):
    number = df_mnist.iloc[k,1:].values.reshape(28, 28)
    plt.imshow(number)
    plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [10], in <module>
      1 for k in range(0,10):
----> 2     number = df_mnist.iloc[k,1:].values.reshape(28, 28)
      3     plt.imshow(number)
      4     plt.show()

NameError: name 'df_mnist' is not defined

M24568 and M20741 Intro to CompSMAP

Part 7: Importing and exporting files / spreadsheets - SOLVED

Contents

Part 7: Importing and exporting files / spreadsheets - SOLVED#

Importing and exporting datasets#

Exercise 7.1#