Bugs: Browse | Submit New | Admin

[#5688] Lost attributes in sgml-parser.rb when Endtagfind encountered

Date:
2006-09-08 13:34
Priority:
3
Submitted By:
Anders Pikas (apikas)
Assigned To:
Nobody (None)
Category:
None
State:
Open
Summary:
Lost attributes in sgml-parser.rb when Endtagfind encountered

Detailed description
I parsed HTML text using ruby-htmltools 1.10, and didn't find the expected attributes. This seems to be related to the
presence of end tags "/>". With a naive patch in sgml-parser.rb the attributes appeared as expected. Below
you will find a 4-line ruby script to trigger the problem, and the patch.

[Please forgive my somewhat shallow understanding of *ML parsing...]

# HTML with attributes
require 'html/xmltree'
p = HTMLTree::XMLParser.new(false)
p.feed("<a><img src='x' alt='1'><c id='2'/></a>")
p p.document.children[0].children[0].attributes
# Expect {"src"=>src='x', "alt"=>alt='1'}
# End

$Id: sgml-parser.rb,v 1.5 2006/07/25 17:23:21 jhannes Exp 
lib/html/sgml-parser.rb $
***************
*** 224,231 ****
           @lasttag = tag
         end
         while k < j
!          break if rawdata.index(Endtagfind, k)
!          break unless rawdata.index(Attrfind, k)
           matched_length = $&.length
           attrname, rest, attrvalue = $1, $2, $3
           if not rest
--- 224,231 ----
           @lasttag = tag
         end
         while k < j
!          break if (rawdata.index(Endtagfind, k) || j) < j
!          break unless (rawdata.index(Attrfind, k) || j) < j
           matched_length = $&.length
           attrname, rest, attrvalue = $1, $2, $3
           if not rest

Add A Comment: Notepad

Please login


Followup

Message
Date: 2008-04-22 18:40
Sender: Stephan Wehner

Here's another case, using the "LinkGrabber" code:

$ irb
irb(main):001:0> require 'html/sgml-parser'
=> true
irb(main):002:0> require 'set'
=> true
irb(main):003:0> 
irb(main):004:0* class LinkGrabber < HTML::SGMLParser
irb(main):005:1>   attr_reader :urls
irb(main):006:1> 
irb(main):007:1*   def initialize
irb(main):008:2>     @urls = Set.new
irb(main):009:2>     super
irb(main):010:2>   end
irb(main):011:1> 
irb(main):012:1*   def do_a(attrs)
irb(main):013:2>     url = attrs.find { |attr| attr[0] ==
'href'}
irb(main):014:2>     @urls << url[1] if url
irb(main):015:2>   end
irb(main):016:1> end
irb(main):019:0> s = '<a
href="page.html">title</a><br />'; 
l=LinkGrabber.new ; l.feed(s); l.urls
=> #<Set: {}>  -----> expected page.html
irb(main):020:0> s = '<a
href="page.html">title</a>';  l=LinkGrabber.new
; l.feed(s); l.urls
=> #<Set: {"page.html"}>  -------> this
looks good.

Date: 2006-11-06 21:22
Sender: Marko Marjanovic

Helpful patch, but still doesn't solve problems with endings
that contain spaces such as this one:

<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />

The patch would work for:

<meta http-equiv="Content-Type" content="text/html;
charset=utf-8"/>

Any solutions? (I tried to analyze the parser, but lacked knowledge
and patience...)

Attached Files:

Name Description Download
No Files Currently Attached

Changes:

No Changes Have Been Made to This Item