Friday 7 May 2021

Note on testing/playing with unicode chars on command-line: position bug on command-line and invisible chars.

Working with unicode chars like non-blank space (U+200D) and zero width space (U+200B) on command-line and in other places like documents, configuration tools, blogger.com editor ;) can be tricky.


You can copy unicode chars and then paste into command-line or other locations and see does it work. 
You are dependant on command-line or other presenting the chars correctly and on backend code being able to handle unicode and store it and also decode and read back from store decode.

Note on testing/playing with unicode chars on command-line.

On command-line the encoding is UTF-8 
e.g. \xE2\x80\x8d for ZWSP; whereas UTF-16 is \x20\x0B

For older bash or ?YMMV linux versions, terminal apps, shells? working with unicode chars on the command-line was a bit MORE fiddly bue to cursor position bug. Working with the NBSP and similar zero-width chars is tricky anyway as they are invisible. 

$ echo "TEST‍‍TEST" | grep "^(TEST)(‍‍*)(TEST)"     # <-- NBSP char invisible in regexp and in text
#          ^^                    ^^        # <-- NBSP char invisible in regexp and in text
#              You can copy unicode chars and paste on command-line.
#              Some unicode chars are visible and others not.
#              If you edit command-line with unicode in it you can get confused! 
#              As you move cursor past a unicode char there ~might~ be a bash/command-line bug,
#                the cursor appears to move one char forward or back each time,
#                but the actual cursor position moves byte by byte, (so for 3byte unicode cursor position will deviate by 2 positions).
#              So, you cannot trust where the cursor appears to be.
#              If you edit command-line after moving past a unicode char you are probably adding chars in a different place and you see this when command gives unexpected results.
#              **To edit command-lines with unicode in them use Ctrl-L to re-draw your command-line screen after moving cursor past unicode chars to see where you really are.**  

~might~ be - Terminator or xterm on ubuntu 20.04 direct command-line or ssh to various machines, even under screen: no bug. With earlier ubuntus/shells/connection methods bug can show up. 
"By default, GNU bash assumes that every character is one byte long and one column wide. A patch for bash 2.04, by Marcin 'Qrczak' Kowalczyk and Ricardas Cepas, teaches bash about multibyte characters in UTF-8 encoding. bash-2.04-diff

Double-width characters, combining characters and bidi are not supported by this patch. It seems a complete redesign of the readline redisplay engine is needed."


echo "TEST☠‍TEST" | xxd
$ echo "TEST☠‍TEST" | xxd
00000000: 5445 5354 e298 a0e2 808d 5445 5354 0a    TEST......TEST.
# note hexdump byte order flipped
$ echo "TEST☠‍TEST" | hexdump
0000000 4554 5453 98e2 e2a0 8d80 4554 5453 000a
000000f
  == ETTS......ETTS.. 
 === TEST<SKULL><ZWSP>TEST\n<NULL>

$ echo "TEST​​TEST" |grep -E "^(TEST)(.*)(TEST)"   # <-- 2 NBSP in text 
TEST​​TEST  <--- 14 bytes here "TEST" 4 chars/bytes, "" 2 NBSP chars, 6 bytes, "TEST" 4 chars/bytes
$ echo "TEST​​TEST" |grep -E "^(TEST)(.*)(TEST)" |xxd
00000000: 5445 5354 e280 8be2 808b 5445 5354 0a    TEST......TEST.

$ echo "TEST​​TEST" |grep -E "^(TEST)(​*)(TEST)"    # <-- 2 NBSP chars in text and 1 NBSP in regexp before *
TEST​​TEST  <--- 14 bytes here "TEST" 4 chars/bytes, "" 2 NBSP chars, 6 bytes, "TEST" 4 chars/bytes
$ echo "TEST​​TEST" |grep -E "^(TEST)(​*)(TEST)" |xxd
00000000: 5445 5354 e280 8be2 808b 5445 5354 0a    TEST......TEST.


#                You can use bash echo or printf to make working with unicode chars more explicit/clearer e.g.
$ printf '\xE2\x98\xA0'
☠$ printf '\xE2\x80\x8d'
$ echo -e  '\xE2\x98\xA0'

# Or use echo echo -e "\u2621" # unicode-escape-sequence if bash >=4.2 

# e.g. in use in command-line, you can explicitly see chars in test and regexp
$ echo "TEST$(printf '\xE2\x80\x8d')TEST" |grep -E "^(TEST)($(printf '\xE2\x80\x8d')*)(TEST)" |xxd
00000000: 5445 5354 e280 8d54 4553 540a            TEST...TEST.


Ubuntu 20.04 laptop$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Ubuntu 20.04 laptop$ echo -e "\u2621"


$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal



## older/red-hat centos 
np$ cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core) 

np$ bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ echo "TEST☠‍TEST" | hexdump
0000000 4554 5453 98e2 e2a0 8d80 4554 5453 000a
000000f
$ echo -e "\u2621"


v# cat /etc/redhat-release 
Fedora release 12 (Constantine)

v# bash --version
GNU bash, version 4.0.33(1)-release (i386-redhat-linux-gnu)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
v# echo "TESTfoo☠‍TEST" | hexdump
0000000 4554 5453 6f66 e26f a098 80e2 548d 5345
0000010 0a54                                   
0000012
v# echo "TESTfoo☠‍TEST" | xxd
0000000: 5445 5354 666f 6fe2 98a0 e280 8d54 4553  TESTfoo......TES
0000010: 540a                                     T.


d# cat /etc/redhat-release 
Fedora release 12 (Constantine)

d# echo "TESTfoobar☠‍afterTEST" | xxd
0000000: 5445 5354 666f 6f62 6172 e298 a0e2 808d  TESTfoobar......
0000010: 6166 7465 7254 4553 540a                 afterTEST.
d# echo "TESTfoobar|B4☠‍AFTafterTEST" | xxd
0000000: 5445 5354 666f 6f62 6172 7c42 34e2 98a0  TESTfoobar|B4...
0000010: e280 8d41 4654 6166 7465 7254 4553 540a  ...AFTafterTEST.
d# echo -e "\u2621"
\u2621

d# bash --version
GNU bash, version 4.0.35(1)-release (i386-redhat-linux-gnu)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


Wow. All the systems I'm logged into today are clear of the bug.
Leaving me suspect it's not just and old bash bug but also might be terminal related so if connecting on system through VPN or string of ssh sessions then terminal capabilities might be messed up and working with unicode chars on command-line is one of a few things that are affected.

No comments: