pandas, csv, xlrd, which one is best for csv and xls files?

 

1, .csv vs .xls:

csv hold plain text as a series of values (cells) separated by commas (,) in a series of lines (rows). However, when I tested in Mac, it seems that values are separated by (;), not (,) as default.  While xls is an MS Excel workbook binary file.

2, python modules: csv & pandas & xlrd:

Both csv and pandas can be used to read .csv file.csv can be used to do simple work with the data that stores in .csv file, while pandas enable you to perform analysis or intense data manipulations.

Both pandas and xlrd can be used to read .xls files. Actually, pandas has used module xlrd when reading or writing .xls files in lower level. So it is xlrd that gave the possibility for reading .xls. However pandas provide calculation and analysis capabities on top of that. The evidence is that if you install pandas but without installing xlrd module, it will shows error when running pandas.read_excel function.

In summary, pandas is a good tool to use for reading both .csv or xls file

3, example codes using pandas to read a .xls file:

Below is a simple code example which can be used to read a .xls files and read the index value or column value accordingly

import sys
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile

filename = sys.argv[1]
sheet_name = sys.argv[2]
df = pd.read_excel(filename, sheet_name = sheet_name)

print(“Column headings:”)
print(df.columns) # print all index value, the first row in excel
print(df[‘$IPv4 Network’]) #print column value whose index is “$IPv4 Network”

4, code example to use csv module to read a csv file and save the data into dictionary:

import csv
import sys
#coding=utf-8

filename = sys.argv[1]

with open(filename,’rU’) as csvfile:
reader = csv.DictReader(csvfile,delimiter= ‘;’)
for line in reader:
print(line)

with .csv file example:

A B C
qer ewr tr
ewe dfe cbv

The output of the above .csv are:

$ python3 Documents/python/readcsvfile.py Documents/test.csv

OrderedDict([(‘\ufeffA’, ‘qer’), (‘B’, ‘ewr’), (‘C’, ‘tr’)])

OrderedDict([(‘\ufeffA’, ‘ewe’), (‘B’, ‘dfe’), (‘C’, ‘cbv’)])

$ python Documents/python/readcsvfile.py Documents/test.csv

{‘\xef\xbb\xbfA’: ‘qer’, ‘B’: ‘ewr’, ‘C’: ‘tr’}

{‘\xef\xbb\xbfA’: ‘ewe’, ‘B’: ‘dfe’, ‘C’: ‘cbv’}

Especially \xef\xbb\xbf  is byte order mark (BOM) of utf-8 unicode, because my test.csv is using utf-8. Normally .csv is encoded with utf-8 or ASCII. And in my example, delimiter = ‘;’ (default is ‘ , ‘) in order to get the correct result.

The result shows that in dictionary there are 3 keywords: A, B and C, each row shows values for the key word, that is ‘qer’, ‘ewe’ for A, ‘ewe’ and ‘die’ for B, etc.

5, Unicode

It is quite often to meet encoding/decoding problem when you are reading a file using unknown unicode. The best way is to find out what unicode used in the original file, then covert it according to the needs. defaultcodevalue in python 2 is ASCII, when reading a utf-8 file, it may generate some error, but not always, we need keep in mind of this.

Some people use the following script to solve above problem, but this is not recommended. More discussion can be found here:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

One thought on “pandas, csv, xlrd, which one is best for csv and xls files?”

Leave a comment