we like it when it is windy: October 2015

rsync is great
very useful
current favourite invocation:

rsync -avzhP --bwlimit=1000 -e ssh <from_files> <touser@host:/path/>

-P is short for --partial --progress
-avhz .. h=humanoid, v=verbose, a=archive, z=compression
.. archive instructs it to maintain time_t values so even if clocks are out rsync knows the true date of each file

2(kind of 3) usability caveats:
(may depend on version of rsync at source and dest?)
(may be related to using ssh?)

1.1 user interrupt leaves truncated file with full file name on destination

IF the user hits ctrl-c on rsync and interrupts transfer then the temporary file at destination (by default named .<from_file>.<6randomchars> is renamed to <from_file>. This could cause a user to think file transfer was successful but file is truncated. If the rsync is resumed then rsync uses the partially transferred file and resumes with new temporary file.

It would be better if the temporary file was kept. (so user knows transfer was not successful).

1.2 resume after network interrupt starts from scratch - doesn't resume from temporary file

IF there is a network interruption and rsync is interrupted then the temporary file at destination is not renamed. If the rsync is resumed then rsync doesn't use the already partially transferred file. rsync starts from scratch. If you rename the (truncated) temporary file to original file name and then resume rsync the rsync uses the partially transferred file.

It would be better rsync resume checks for temporary file and resumes transfer from there.

2. permission to write on destination error not reported until files fully transferred over network

IF touser cannot write to /path.
(and if from_files are large and network is slow)
I have seen rsync transfer 100% of all files over network.
Then - possibly hours after transfer start - rsync reports permission denied.
There is no temporary file on destination.

$ rsync -avzhP TEST -e ssh touser@host:/path/
sending incremental file list
TEST
5 100% 0.00kB/s 0:00:00 (xfer#1, to-check=0/1)
rsync: mkstemp "/path/.TEST.zXCDQB" failed: Permission denied (13)

sent 74 bytes received 31 bytes 14.00 bytes/sec
total size is 5 speedup is 0.05
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]
$

### rsync versions at source and dest:
$ rsync --version
rsync version 3.0.7 protocol version 30

$ rsync --version
rsync version 3.0.9 protocol version 30

This is a KNOWN rsync bug/issue:
https://bugzilla.samba.org/show_bug.cgi?id=6788
Bug 6788 - Skip file transfer if destination file cannot be opened for writing

Other bugs:
https://bugzilla.samba.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&product=rsync

Tuesday, 13 October 2015

python pandas - data analysis library - quickstart / introduction: 1. read data (pandas IO) 2. database style merge (inner join indexing on date+time) 3. write data.

I discovered python pandas recently.
Using to read in ifconfig logs and add up total network traffic (VM network) across multiple hosts.
I needed to combine files from multiple hosts, sort values and combine by date and time.

I was sorting and parsing into python dicts.
I looked at perl PDL and other data manipulation libs.
But looks like python pandas wins because it combines VERY FLEXIBLE file IO with a lot of methods for selecting rows/columns and manipulating the data.

I made a simple test to make sure I knew how it all worked and described it here:
http://stackoverflow.com/questions/19222043/parse-two-files-and-merge-lines-if-time-stamp-matches/33100756#33100756

A flexible and general way of manipulating data is python pandas. Worth mentioning here as it really is the right tool for the job. Allows spreadsheet or database style merges/joins/concats on selected index rows or columns.

Two example files to illustrate how it works

$ cat File1
date0,time0,data01,data02,data03
date1,time1,data11,data12,data13
date2,time2,data21,data22,data23
date3,time3,data31,data32,data33
date4,time4,data41,data42,data43
date5,time5,data51,data52,data53
$ cat File2
date1,time1,data14
date4,time4,data44
date2,time2,data24

Run python . . .

Use pandas read_csv to slurp in files in pandas table structure. (read_csv is very clever and can read in many formats not just csv)
Use pandas merge to do inner(intersection of indices) join, using date+time as indices (index list=[0,1]).
Use pandas to_csv to write output.

THE IMPORTANT BIT:

$ python
>>> from pandas import merge, read_csv
>>> f1=read_csv("File1",header=None)
>>> f2=read_csv("File2",header=None)
>>> merged = merge(f1, f2, how='inner', left_on=[0,1], right_on=[0,1])
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)
>>> [Ctrl-D]

Job done!

$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44

Easy as 1, 2, 3.

1. read data (pandas IO)

2. database style merge (inner join indexing on date+time)

3. write data.

VERY clean, no messing. I really do love bash/grep/sed/awk also perl and python manipulating data in structures BUT right tool for the job makes the job much easier and gives much more potential for use of the data.

Breakdown:

1. read_csv A bog-standard(plain, unadorned) 'read_csv("File1")' treats first line as header names. So we use 'header=None'.

>>> f1=read_csv("File1")
>>> f1
date0  time0  data01  data02  data03
0  date1  time1  data11  data12  data13
1  date2  time2  data21  data22  data23
2  date3  time3  data31  data32  data33
3  date4  time4  data41  data42  data43
4  date5  time5  data51  data52  data53
>>> f1=read_csv("File1",header=None)
>>> f1
0      1       2       3       4
0  date0  time0  data01  data02  data03
1  date1  time1  data11  data12  data13
2  date2  time2  data21  data22  data23
3  date3  time3  data31  data32  data33
4  date4  time4  data41  data42  data43
5  date5  time5  data51  data52  data53
>>> f2=read_csv("File2",header=None)

pandas DataFrame 'describe()' gives a useful summary especially for big tables. For numeric data you also get total, max, min, mean, e.t.c.

>>> f1.describe()
0      1       2       3       4
count       6      6       6       6       6
unique      6      6       6       6       6
top     date4  time3  data01  data12  data13
freq        1      1       1       1       1

2. merge

how specifies left/right/inner/outer merge style sql join terminology. how='left' takes first file indices (date+time) as output and merge in the second files data. how='right' takes the second files indices as output and merge in the first files data. how='inner' does an intersection between each files indices (date+time) so only data for which you have entries in both files are taken. how='outer' does an union between each files indices (date+time) so all data is written, data for which you do not have entries in both files is filled with 'NaN' values.

on/left_on/right_on index select We could also use 'on=[0,1]' as our input files have the same columns of indices (and they are named the same 0 and 1 as we read in the file with 'header=None').

>>> merged = merge(f1, f2, how='inner', left_on=[0,1], right_on=[0,1])
>>> merged
0      1     2_x       3       4     2_y
0  date1  time1  data11  data12  data13  data14
1  date2  time2  data21  data22  data23  data24
2  date4  time4  data41  data42  data43  data44
>>> mergedOut = merge(f1, f2, how='outer', left_on=[0,1], right_on=[0,1])
>>> mergedOut
0      1     2_x       3       4     2_y
0  date0  time0  data01  data02  data03     NaN
1  date1  time1  data11  data12  data13  data14
2  date2  time2  data21  data22  data23  data24
3  date3  time3  data31  data32  data33     NaN
4  date4  time4  data41  data42  data43  data44
5  date5  time5  data51  data52  data53     NaN

3. to_csv we write out without and index or a header using 'index=False' and 'header=False'. See the output with index and header written to the "Out2" file:

>>> merged.to_csv("Out2")
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)

$ cat Out2
,0,1,2_x,3,4,2_y
0,date1,time1,data11,data12,data13,data14
1,date2,time2,data21,data22,data23,data24
2,date4,time4,data41,data42,data43,data44
$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44

Documentation to get started with:

pandas IO tools for reading and writing data files: http://pandas.pydata.org/pandas-docs/stable/io.html
the pandas DataFrame objects: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html Select rows/columns/data. Operate on part or all of dataset. Loads of stuff.
pandas merging/join/concat http://pandas.pydata.org/pandas-docs/stable/merging.html

To get pandas and install: http://pandas.pydata.org/getpandas.html

# download, unpack and:
sudo python setup.py install