Part 7: Importing and exporting files / spreadsheets - SOLVED
Contents
Part 7: Importing and exporting files / spreadsheets - SOLVED#
This workbook does not require you to import any new datasets. Instead, it will make use of sample data provided to use in pandas. However, as usual you will need to import numpy and pandas
# Use this cell to import numpy and pandas
import numpy as np
import pandas as pd
Importing and exporting datasets#
So far we have learnt how to import files such as Excel files. However, often the data will be available to you in a .csv format, this is a spreadsheet without fancy Excel formatting.
In this section we will learn how to import csv.
We will also learn how to export our data frames to csv and xlsx files.
Let us import csv from sample_data folder
df_california_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_california_housing.head()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [2], in <module>
----> 1 df_california_housing = pd.read_csv("sample_data/california_housing_train.csv")
2 df_california_housing.head()
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/california_housing_train.csv'
let us add a column bedroom per person
df_california_housing["bedroom per person"] = df_california_housing["total_bedrooms"] / df_california_housing["population"]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [3], in <module>
----> 1 df_california_housing["bedroom per person"] = df_california_housing["total_bedrooms"] / df_california_housing["population"]
NameError: name 'df_california_housing' is not defined
df_california_housing.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [4], in <module>
----> 1 df_california_housing.head()
NameError: name 'df_california_housing' is not defined
Let us now export this dataframe to a csv and xls spreadsheets
df_california_housing.to_csv("california_housing_new.csv", index=False)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [5], in <module>
----> 1 df_california_housing.to_csv("california_housing_new.csv", index=False)
NameError: name 'df_california_housing' is not defined
df_california_housing.to_excel("california_housing_new.xlsx", index=False)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [6], in <module>
----> 1 df_california_housing.to_excel("california_housing_new.xlsx", index=False)
NameError: name 'df_california_housing' is not defined
Exercise 7.1#
Import mnist_test.csv from sample data as a data frame. Call the dataframe df_mnist. We will need to import it without the header. To achieve this set header parameter to None inside read_csv function. After that display the data frame.
df_mnist = pd.read_csv("sample_data/mnist_test.csv", header=None)
df_mnist
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [7], in <module>
----> 1 df_mnist = pd.read_csv("sample_data/mnist_test.csv", header=None)
2 df_mnist
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File /opt/homebrew/Caskroom/miniconda/base/envs/pycbc_test/lib/python3.9/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/mnist_test.csv'
Each row consists of the label of the handwritten digit, e.g., first row represents 7, and 784 pixels, value 255 represents white and 0 represents black. Originally these were 28 x 28 pixels images representing handwritten digits. We will reshape and plot one row.
import matplotlib.pyplot as plt
seven = df_mnist.iloc[0,1:].values.reshape(28, 28)
plt.imshow(seven)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [9], in <module>
----> 1 seven = df_mnist.iloc[0,1:].values.reshape(28, 28)
2 plt.imshow(seven)
NameError: name 'df_mnist' is not defined
Your task is to plot the first ten rows as “digit images”. Please use the for loop. Don’t forget to use plt.show() after plt.imshow, otherwise you will see only one picture.
for k in range(0,10):
number = df_mnist.iloc[k,1:].values.reshape(28, 28)
plt.imshow(number)
plt.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [10], in <module>
1 for k in range(0,10):
----> 2 number = df_mnist.iloc[k,1:].values.reshape(28, 28)
3 plt.imshow(number)
4 plt.show()
NameError: name 'df_mnist' is not defined