Patches: Browse | Submit New | Admin

[#5421] a patch for the sgml-parser for various problems with HTML parsing

Date:
2006-08-15 08:42
Priority:
3
Submitted By:
Nobody
Assigned To:
Nobody (None)
Category:
None
State:
Open
Summary:
a patch for the sgml-parser for various problems with HTML parsing

Detailed description
there are some problems with the parse_starttag method in the SGMLParser.

Example 1:

<html>
  <body id="test">
    <div><img src="someimage.jpg" /></div>
  </body>
</html>

The id attribute will not be recognised. The following line in parse_starttag will cancel the search for attributes
because of the image tag.

   


------------------------------------------

Example 2:

<html>
  <body >
    <div>Lorem ipsum dolor sit amet ...</div>
  </body>
</html>

An attribute "div" will be assigned to the body tag. Because of the whitespace in the body tag, the while-loop
inside the parse_starttag method will be entered. Inside the while-loop the following line will identify "div"
as an attribute for the body tag

  break unless rawdata.index(Attrfind, k)


------------------------------------------

In order to solve these problems two changes inside the parse_startag method are necessary.

1. remove the following line, so the parser will not stop looking for attributes.

   break if rawdata.index(Endtagfind, k)

2. After an attribute or something that resembles an attribute is found, test if it is within the current tag. This
can be achieved by inserting the following line below matched_length = $&.length

   break unless k+matched_length<=j

------------------------------------------

The file containing the changes descibed above is attached.

Add A Comment: Notepad

Please login


Followup

No Followups Have Been Posted

Attached Files:

Name Description Download
sgml-parser.rb Download

Changes:

Field Old Value Date By
File Added748: sgml-parser.rb2006-08-15 08:42None