Bugs: Browse | Submit New | Admin

[#13527] CGI::unescapeHTML doesn't properly handle chars that are >= 128 and < 256 when $KCODE = 'UTF8'

Date:
2007-08-30 15:22
Priority:
3
Submitted By:
Paul Haddad (paulpthcom)
Assigned To:
Shyouhei Urabe (shyouhei)
Category:
Misc / Other Standard Library
State:
Open
Platform:
 
Summary:
CGI::unescapeHTML doesn't properly handle chars that are >= 128 and < 256 when $KCODE = 'UTF8'

Detailed description
The following logic is currently used.

        if $1.hex < 256
          $1.hex.chr
        else
          if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
            [$1.hex].pack("U")
          else
            "&#x#{$1};"
          end
        end


The problem with the above is that UTF8 is only compatible with ASCII up to char 127.  Anything above that needs multiple
bytes.  This probably should be tweaked to use iconv and have support for multiple encodings, but if we're worrying
only about UTF-8 it probably should be something like this.

  if $KCODE == 'UTF8'
    [$1.hex].pack('U')
  elsif $1.hex < 256
    $1.hex.chr
  else
    "&#x#{$1};"
  end

The below is a simple example of how to reproduce the problem

require 'cgi'
$KCODE = 'UTF-8'
p CGI::unescapeHTML('&#xae;')

This should print the Copyright symbol, but instead prints \256

Add A Comment: Notepad

Please login


Followup

No Followups Have Been Posted

Attached Files:

Name Description Download
No Files Currently Attached

Changes:

No Changes Have Been Made to This Item