Using to read in ifconfig logs and add up total network traffic (VM network) across multiple hosts.
I needed to combine files from multiple hosts, sort values and combine by date and time.
I was sorting and parsing into python dicts.
I looked at perl PDL and other data manipulation libs.
But looks like python pandas wins because it combines VERY FLEXIBLE file IO with a lot of methods for selecting rows/columns and manipulating the data.
I made a simple test to make sure I knew how it all worked and described it here:
http://stackoverflow.com/questions/19222043/parse-two-files-and-merge-lines-if-time-stamp-matches/33100756#33100756
A flexible and general way of manipulating data is python pandas. Worth mentioning here as it really is the right tool for the job. Allows spreadsheet or database style merges/joins/concats on selected index rows or columns.
Two example files to illustrate how it works
$ cat File1
date0,time0,data01,data02,data03
date1,time1,data11,data12,data13
date2,time2,data21,data22,data23
date3,time3,data31,data32,data33
date4,time4,data41,data42,data43
date5,time5,data51,data52,data53
$ cat File2
date1,time1,data14
date4,time4,data44
date2,time2,data24
Run python . . .
- Use pandas read_csv to slurp in files in pandas table structure. (read_csv is very clever and can read in many formats not just csv)
- Use pandas merge to do inner(intersection of indices) join, using date+time as indices (index list=[0,1]).
- Use pandas to_csv to write output.
THE IMPORTANT BIT:
$ python
>>> from pandas import merge, read_csv
>>> f1=read_csv("File1",header=None)
>>> f2=read_csv("File2",header=None)
>>> merged = merge(f1, f2, how='inner', left_on=[0,1], right_on=[0,1])
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)
>>> [Ctrl-D]
Job done!
$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44
Easy as 1, 2, 3.
1. read data (pandas IO)
2. database style merge (inner join indexing on date+time)
3. write data.
VERY clean, no messing. I really do love bash/grep/sed/awk also perl and python manipulating data in structures BUT right tool for the job makes the job much easier and gives much more potential for use of the data.
Breakdown:
1. read_csv A bog-standard(plain, unadorned) 'read_csv("File1")' treats first line as header names. So we use 'header=None'.
>>> f1=read_csv("File1")
>>> f1
date0 time0 data01 data02 data03
0 date1 time1 data11 data12 data13
1 date2 time2 data21 data22 data23
2 date3 time3 data31 data32 data33
3 date4 time4 data41 data42 data43
4 date5 time5 data51 data52 data53
>>> f1=read_csv("File1",header=None)
>>> f1
0 1 2 3 4
0 date0 time0 data01 data02 data03
1 date1 time1 data11 data12 data13
2 date2 time2 data21 data22 data23
3 date3 time3 data31 data32 data33
4 date4 time4 data41 data42 data43
5 date5 time5 data51 data52 data53
>>> f2=read_csv("File2",header=None)
pandas DataFrame 'describe()' gives a useful summary especially for big tables. For numeric data you also get total, max, min, mean, e.t.c.
>>> f1.describe()
0 1 2 3 4
count 6 6 6 6 6
unique 6 6 6 6 6
top date4 time3 data01 data12 data13
freq 1 1 1 1 1
2. merge
how specifies left/right/inner/outer merge style sql join terminology. how='left' takes first file indices (date+time) as output and merge in the second files data. how='right' takes the second files indices as output and merge in the first files data. how='inner' does an intersection between each files indices (date+time) so only data for which you have entries in both files are taken. how='outer' does an union between each files indices (date+time) so all data is written, data for which you do not have entries in both files is filled with 'NaN' values.
on/left_on/right_on index select We could also use 'on=[0,1]' as our input files have the same columns of indices (and they are named the same 0 and 1 as we read in the file with 'header=None').
>>> merged = merge(f1, f2, how='inner', left_on=[0,1], right_on=[0,1])
>>> merged
0 1 2_x 3 4 2_y
0 date1 time1 data11 data12 data13 data14
1 date2 time2 data21 data22 data23 data24
2 date4 time4 data41 data42 data43 data44
>>> mergedOut = merge(f1, f2, how='outer', left_on=[0,1], right_on=[0,1])
>>> mergedOut
0 1 2_x 3 4 2_y
0 date0 time0 data01 data02 data03 NaN
1 date1 time1 data11 data12 data13 data14
2 date2 time2 data21 data22 data23 data24
3 date3 time3 data31 data32 data33 NaN
4 date4 time4 data41 data42 data43 data44
5 date5 time5 data51 data52 data53 NaN
3. to_csv we write out without and index or a header using 'index=False' and 'header=False'. See the output with index and header written to the "Out2" file:
>>> merged.to_csv("Out2")
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)
$ cat Out2
,0,1,2_x,3,4,2_y
0,date1,time1,data11,data12,data13,data14
1,date2,time2,data21,data22,data23,data24
2,date4,time4,data41,data42,data43,data44
$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44
Documentation to get started with:
- pandas IO tools for reading and writing data files: http://pandas.pydata.org/pandas-docs/stable/io.html
- the pandas DataFrame objects: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html Select rows/columns/data. Operate on part or all of dataset. Loads of stuff.
- pandas merging/join/concat http://pandas.pydata.org/pandas-docs/stable/merging.html
To get pandas and install: http://pandas.pydata.org/getpandas.html
# download, unpack and:
sudo python setup.py install
No comments:
Post a Comment