Friday, 23 August 2019

grepping for non-ascii chars . . grep from 0x80->0xFF doesn't find > 2 byte encoded chars unless set locale e.g. with LC_ALL=C


A cow-orker asked this old chestnut again.

Actually that problem ended up being a CR in between {} in a big .json file, not funny characters.

But along the way we found something confusing and learned that grep for chars from 0x80 to 0xff does show extended chars encoded in 2 bytes BUT weirdly NON intuitively it doesn't seem to match with extended chars encoded in 3 and 4 bytes.

 * © copyright c2a9 or   underscore c2a0 chars would match

 * ๐Ÿ’˜ Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298  did not match

 * ใ‚ a in japanese hiragana  e38182 did not match

If you give the magical pre-invocation LC_ALL=C to the grep . . then it DOES match 2/3/4 byte extended characters.


This ANSWER updated goes into full explanation:


https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters/46343900#46343900

I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"

Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.

I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:

grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

ACTUALLY, generally you will need to do this:

LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

breakdown:

LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps 
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches

E.g. practical example of use find to grep all files under current directory:

LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes

Dec   Hex Ctrl Char description           Dec Hex Ctrl Char description
0     00  ^@  NULL                        16  10  ^P  DATA LINK ESCAPE (DLE)
1     01  ^A  START OF HEADING (SOH)      17  11  ^Q  DEVICE CONTROL 1 (DC1)
2     02  ^B  START OF TEXT (STX)         18  12  ^R  DEVICE CONTROL 2 (DC2)
3     03  ^C  END OF TEXT (ETX)           19  13  ^S  DEVICE CONTROL 3 (DC3)
4     04  ^D  END OF TRANSMISSION (EOT)   20  14  ^T  DEVICE CONTROL 4 (DC4)
5     05  ^E  END OF QUERY (ENQ)          21  15  ^U  NEGATIVE ACKNOWLEDGEMENT (NAK)
6     06  ^F  ACKNOWLEDGE (ACK)           22  16  ^V  SYNCHRONIZE (SYN)
7     07  ^G  BEEP (BEL)                  23  17  ^W  END OF TRANSMISSION BLOCK (ETB)
8     08  ^H  BACKSPACE (BS)**            24  18  ^X  CANCEL (CAN)
9     09  ^I  HORIZONTAL TAB (HT)**       25  19  ^Y  END OF MEDIUM (EM)
10    0A  ^J  LINE FEED (LF)**            26  1A  ^Z  SUBSTITUTE (SUB)
11    0B  ^K  VERTICAL TAB (VT)**         27  1B  ^[  ESCAPE (ESC)
12    0C  ^L  FF (FORM FEED)**            28  1C  ^\  FILE SEPARATOR (FS) RIGHT ARROW
13    0D  ^M  CR (CARRIAGE RETURN)**      29  1D  ^]  GROUP SEPARATOR (GS) LEFT ARROW
14    0E  ^N  SO (SHIFT OUT)              30  1E  ^^  RECORD SEPARATOR (RS) UP ARROW
15    0F  ^O  SI (SHIFT IN)               31  1F  ^_  UNIT SEPARATOR (US) DOWN ARROW


UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. @frabjous asked and @calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.

e.g. my locale LC_ALL= empty

$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=
grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:

$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call  underscore c2a0
9:CTRL
31:5 © copyright
32:7 call  underscore
grep with LC_ALL=C does seem to match all extended characters that you would want:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
1 ‐‐ unicode dashes e28090
3 ๐Ÿ’˜ Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call  underscore c2a0
9 CTRL-H CHARS URK URK URK
11 LIVE‐E! ใ‚ใ„ใ†ใˆใŠ ใ‹ใŒ ใ‚ขใ‚คใ‚ฆใ‚จใ‚ช ใ‚ซใ‚ฌ แšŠ แš‹ เธ‹เธŒ เค†เค‡  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3 ๐Ÿ’˜ Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call  underscore
33 11 LIVE‐E! ใ‚ใ„ใ†ใˆใŠ ใ‹ใŒ ใ‚ขใ‚คใ‚ฆใ‚จใ‚ช ใ‚ซใ‚ฌ แšŠ แš‹ เธ‹เธŒ เค†เค‡  YEOW, mix of japanese and chars from other
34 52 LIVE‐E! ใ‚ใ„ใ†ใˆใŠ ใ‹ใŒ ใ‚ขใ‚คใ‚ฆใ‚จใ‚ช ใ‚ซใ‚ฌ แšŠ แš‹ เธ‹เธŒ เค†เค‡  YEOW, mix of japanese and chars from other
73 LIVE‐E! ใ‚ใ„ใ†ใˆใŠ ใ‹ใŒ ใ‚ขใ‚คใ‚ฆใ‚จใ‚ช ใ‚ซใ‚ฌ แšŠ แš‹ เธ‹เธŒ เค†เค‡  YEOW, mix of japanese and chars from other
SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test

No comments: