Bugs: Browse | Submit New | Admin

[#4933] HTML entities mangled by RubyfulSoup?

Date:
2006-07-03 21:40
Priority:
3
Submitted By:
Peter Krantz (peterkz)
Assigned To:
Leonard Richardson (leonardr)
Category:
None
State:
Open
Summary:
HTML entities mangled by RubyfulSoup?

Detailed description
irb(main):194:0> htm = "<a href='/'>Read&nbsp;more</a>"
=> "<a href='/'>Read&nbsp;more</a>"
irb(main):195:0> soup = BeautifulSoup.new(htm)
=> <a href="/">Read%nbspmore</a>

It looks like &nbsp; has been converted to %nbsp. This may make it difficult to clean parsed data. Or am I doing
something wrong?

Add A Comment: Notepad

Please login


Followup

Message
Date: 2007-05-17 02:40
Sender: Daniel Hoey

I think that I have the same issue. The following unit test will
fail:

require 'rubygems'
require 'rubyful_soup'
require 'test/unit'

class RubyfulSoupTest < Test::Unit::TestCase
  def test_greater_than
    html = "<span>a &gt; b</span>"
    @soup = BeautifulSoup.new(html)
    assert_equal(html, @soup.to_s)
  end
end

This code will fix the issue (and I believe will fix Peter's
issue also):
class HTML::SGMLParser
   Entitydefs = {} 
end

class BeautifulStoneSoup
  def unknown_entityref(ref)
    handle_data("&#{ref};")
  end
end

Attached Files:

Name Description Download
No Files Currently Attached

Changes:

Field Old Value Date By
assigned_tonone2007-03-16 10:09peterkz