The following logic is currently used.
if $1.hex < 256
$1.hex.chr
else
if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
[$1.hex].pack("U")
else
"&#x#{$1};"
end
end
The problem with the above is that UTF8 is only compatible with ASCII up to char 127. Anything above that needs multiple
bytes. This probably should be tweaked to use iconv and have support for multiple encodings, but if we're worrying
only about UTF-8 it probably should be something like this.
if $KCODE == 'UTF8'
[$1.hex].pack('U')
elsif $1.hex < 256
$1.hex.chr
else
"&#x#{$1};"
end
The below is a simple example of how to reproduce the problem
require 'cgi'
$KCODE = 'UTF-8'
p CGI::unescapeHTML('®')
This should print the Copyright symbol, but instead prints \256 |