Tuesday 13 October 2015

python pandas - data analysis library - quickstart / introduction: 1. read data (pandas IO) 2. database style merge (inner join indexing on date+time) 3. write data.

I discovered python pandas recently.
Using to read in ifconfig logs and add up total network traffic (VM network) across multiple hosts.
I needed to combine files from multiple hosts, sort values and combine by date and time.

I was sorting and parsing into python dicts.
I looked at perl PDL and other data manipulation libs.
But looks like python pandas wins because it combines VERY FLEXIBLE file IO with a lot of methods for selecting rows/columns and manipulating the data.

I made a simple test to make sure I knew how it all worked and described it here:
http://stackoverflow.com/questions/19222043/parse-two-files-and-merge-lines-if-time-stamp-matches/33100756#33100756

A flexible and general way of manipulating data is python pandas. Worth mentioning here as it really is the right tool for the job. Allows spreadsheet or database style merges/joins/concats on selected index rows or columns.
Two example files to illustrate how it works
$ cat File1
date0,time0,data01,data02,data03
date1,time1,data11,data12,data13
date2,time2,data21,data22,data23
date3,time3,data31,data32,data33
date4,time4,data41,data42,data43
date5,time5,data51,data52,data53
$ cat File2
date1,time1,data14
date4,time4,data44
date2,time2,data24
Run python . . .
  • Use pandas read_csv to slurp in files in pandas table structure. (read_csv is very clever and can read in many formats not just csv)
  • Use pandas merge to do inner(intersection of indices) join, using date+time as indices (index list=[0,1]).
  • Use pandas to_csv to write output.
THE IMPORTANT BIT:
$ python
>>> from pandas import merge, read_csv
>>> f1=read_csv("File1",header=None)
>>> f2=read_csv("File2",header=None)
>>> merged = merge(f1, f2, how='inner', left_on=[0,1], right_on=[0,1])
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)
>>> [Ctrl-D]
Job done!
$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44
Easy as 1, 2, 3.
1. read data (pandas IO) 
2. database style merge (inner join indexing on date+time) 
3. write data. 
VERY clean, no messing. I really do love bash/grep/sed/awk also perl and python manipulating data in structures BUT right tool for the job makes the job much easier and gives much more potential for use of the data.
Breakdown:
1. read_csv A bog-standard(plain, unadorned) 'read_csv("File1")' treats first line as header names. So we use 'header=None'.
>>> f1=read_csv("File1")
>>> f1
date0  time0  data01  data02  data03
0  date1  time1  data11  data12  data13
1  date2  time2  data21  data22  data23
2  date3  time3  data31  data32  data33
3  date4  time4  data41  data42  data43
4  date5  time5  data51  data52  data53
>>> f1=read_csv("File1",header=None)
>>> f1
0      1       2       3       4
0  date0  time0  data01  data02  data03
1  date1  time1  data11  data12  data13
2  date2  time2  data21  data22  data23
3  date3  time3  data31  data32  data33
4  date4  time4  data41  data42  data43
5  date5  time5  data51  data52  data53
>>> f2=read_csv("File2",header=None)
pandas DataFrame 'describe()' gives a useful summary especially for big tables. For numeric data you also get total, max, min, mean, e.t.c.
>>> f1.describe()
0      1       2       3       4
count       6      6       6       6       6
unique      6      6       6       6       6
top     date4  time3  data01  data12  data13
freq        1      1       1       1       1
2. merge
how specifies left/right/inner/outer merge style sql join terminology. how='left' takes first file indices (date+time) as output and merge in the second files data. how='right' takes the second files indices as output and merge in the first files data. how='inner' does an intersection between each files indices (date+time) so only data for which you have entries in both files are taken. how='outer' does an union between each files indices (date+time) so all data is written, data for which you do not have entries in both files is filled with 'NaN' values.
on/left_on/right_on index select We could also use 'on=[0,1]' as our input files have the same columns of indices (and they are named the same 0 and 1 as we read in the file with 'header=None').
>>> merged = merge(f1, f2, how='inner', left_on=[0,1], right_on=[0,1])
>>> merged
0      1     2_x       3       4     2_y
0  date1  time1  data11  data12  data13  data14
1  date2  time2  data21  data22  data23  data24
2  date4  time4  data41  data42  data43  data44
>>> mergedOut = merge(f1, f2, how='outer', left_on=[0,1], right_on=[0,1])
>>> mergedOut
0      1     2_x       3       4     2_y
0  date0  time0  data01  data02  data03     NaN
1  date1  time1  data11  data12  data13  data14
2  date2  time2  data21  data22  data23  data24
3  date3  time3  data31  data32  data33     NaN
4  date4  time4  data41  data42  data43  data44
5  date5  time5  data51  data52  data53     NaN
3. to_csv we write out without and index or a header using 'index=False' and 'header=False'. See the output with index and header written to the "Out2" file:
>>> merged.to_csv("Out2")
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)

$ cat Out2
,0,1,2_x,3,4,2_y
0,date1,time1,data11,data12,data13,data14
1,date2,time2,data21,data22,data23,data24
2,date4,time4,data41,data42,data43,data44
$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44
Documentation to get started with:
To get pandas and install: http://pandas.pydata.org/getpandas.html
# download, unpack and:
sudo python setup.py install

No comments: