Friday, 22 September 2017

Searching for non-printable chars. 'grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *' also cvs server searching for corrupt files

A cvs rlog (and log) command was failing.
This prevented a jenkins job from starting.

The cvs log failed when it hit a certain file.
$ cvs log dir/dir/file
cvs [server aborted]: unexpected '\x49' reading revision number in RCS file cvsroot/module/dir/dir/file,v
On the cvs server the file had cvs version/log details at start and latest text of file after but a section of binary junk in the middle.

So to check for corruption in other cvs files . . .

https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix

Searching for non-ASCII chars just involves eliminating any char > 0x80.
Searching for non-printable chars.is more useful.
Useful grep:

    grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

breakdown:

    \x00-\x08 - non-printable control chars 0 - 7 decimal
    \x0E-\x1F - more non-printable control chars 14 - 31 decimal
    \x80-1xFF - non-printable chars > 128 decimal
    -c - print count of matching lines instead of lines
    -P - perl style regexps

Instead of -c you may prefer to use -n (and optionally -b) or -l

    -n, --line-number
    -b, --byte-offset
    -l, --files-with-matches

E.g. practical example of use find to grep all files under current directory:

    find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} + 

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes 
Dec   Hex Ctrl Char description           Dec Hex Ctrl Char description 
0     00  ^@  NULL                        16  10  ^P  DATA LINK ESCAPE (DLE)  
1     01  ^A  START OF HEADING (SOH)      17  11  ^Q  DEVICE CONTROL 1 (DC1)
2     02  ^B  START OF TEXT (STX)         18  12  ^R  DEVICE CONTROL 2 (DC2) 
3     03  ^C  END OF TEXT (ETX)           19  13  ^S  DEVICE CONTROL 3 (DC3) 
4     04  ^D  END OF TRANSMISSION (EOT)   20  14  ^T  DEVICE CONTROL 4 (DC4) 
5     05  ^E  END OF QUERY (ENQ)          21  15  ^U  NEGATIVE ACKNOWLEDGEMENT (NAK) 
6     06  ^F  ACKNOWLEDGE (ACK)           22  16  ^V  SYNCHRONIZE (SYN) 
7     07  ^G  BEEP (BEL)                  23  17  ^W  END OF TRANSMISSION BLOCK (ETB)
8     08  ^H  BACKSPACE (BS)**            24  18  ^X  CANCEL (CAN) 
9     09  ^I  HORIZONTAL TAB (HT)**       25  19  ^Y  END OF MEDIUM (EM)
10    0A  ^J  LINE FEED (LF)**            26  1A  ^Z  SUBSTITUTE (SUB)
11    0B  ^K  VERTICAL TAB (VT)**         27  1B  ^[  ESCAPE (ESC)
12    0C  ^L  FF (FORM FEED)**            28  1C  ^\  FILE SEPARATOR (FS) RIGHT ARROW
13    0D  ^M  CR (CARRIAGE RETURN)**      29  1D  ^]  GROUP SEPARATOR (GS) LEFT ARROW
14    0E  ^N  SO (SHIFT OUT)              30  1E  ^^  RECORD SEPARATOR (RS) UP ARROW 
15    0F  ^O  SI (SHIFT IN)               31  1F  ^_  UNIT SEPARATOR (US) DOWN 

There was just one corrupt file on cvs server.
Remove file from local workarea, remove file from cvs, remove Attic file on server, move file back in workarea, cvs add and commit. Log and version history and tags are lost, just have latest version of file.