there are some problems with the parse_starttag method in the SGMLParser.
Example 1:
<html>
<body id="test">
<div><img src="someimage.jpg" /></div>
</body>
</html>
The id attribute will not be recognised. The following line in parse_starttag will cancel the search for attributes
because of the image tag.
------------------------------------------
Example 2:
<html>
<body >
<div>Lorem ipsum dolor sit amet ...</div>
</body>
</html>
An attribute "div" will be assigned to the body tag. Because of the whitespace in the body tag, the while-loop
inside the parse_starttag method will be entered. Inside the while-loop the following line will identify "div"
as an attribute for the body tag
break unless rawdata.index(Attrfind, k)
------------------------------------------
In order to solve these problems two changes inside the parse_startag method are necessary.
1. remove the following line, so the parser will not stop looking for attributes.
break if rawdata.index(Endtagfind, k)
2. After an attribute or something that resembles an attribute is found, test if it is within the current tag. This
can be achieved by inserting the following line below matched_length = $&.length
break unless k+matched_length<=j
------------------------------------------
The file containing the changes descibed above is attached.
|