I parsed HTML text using ruby-htmltools 1.10, and didn't find the expected attributes. This seems to be related to the
presence of end tags "/>". With a naive patch in sgml-parser.rb the attributes appeared as expected. Below
you will find a 4-line ruby script to trigger the problem, and the patch.
[Please forgive my somewhat shallow understanding of *ML parsing...]
# HTML with attributes
require 'html/xmltree'
p = HTMLTree::XMLParser.new(false)
p.feed("<a><img src='x' alt='1'><c id='2'/></a>")
p p.document.children[0].children[0].attributes
# Expect {"src"=>src='x', "alt"=>alt='1'}
# End
$Id: sgml-parser.rb,v 1.5 2006/07/25 17:23:21 jhannes Exp
lib/html/sgml-parser.rb $
***************
*** 224,231 ****
@lasttag = tag
end
while k < j
! break if rawdata.index(Endtagfind, k)
! break unless rawdata.index(Attrfind, k)
matched_length = $&.length
attrname, rest, attrvalue = $1, $2, $3
if not rest
--- 224,231 ----
@lasttag = tag
end
while k < j
! break if (rawdata.index(Endtagfind, k) || j) < j
! break unless (rawdata.index(Attrfind, k) || j) < j
matched_length = $&.length
attrname, rest, attrvalue = $1, $2, $3
if not rest
|