[Mediacloth-devel] [PATCH] Handle plain HTML tags
owen at fraser-green.com
Sun May 18 13:27:50 EDT 2008
Using Mediacloth today I found it choking on non-XHTML-style <br> tags in Wikipedia content. I did some investigation to discover whether the Wikipedia editor or Mediacloth was at fault and the closest I came to a conclusion was <http://www.mediawiki.org/wiki/Markup_spec> where it states that "some tags (e.g. <br>) don't need to be closed" so I guess Mediacloth should handle it.
It seems a bit ugly but I couldn't really think of a better way than to hardcode the lexer with an awareness of tags which could potentially have no closing tags using <http://en.wikipedia.org/wiki/HTML_element> as a guide. The attached patch implements this policy.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1987 bytes
Desc: not available
More information about the Mediacloth-devel