rexml/httpwrite, why must you mock me?
Daniel Sheppard
daniels at pronto.com.au
Tue Sep 27 01:45:04 EDT 2005
I got struck by a little inspiration last night, and thought "Hey,
mousehole, how would you like to have some fun manipulating rss feeds?".
MouseHole didn't seem too impressed by the idea.
Firstly, after littering my code with debug messages, I found that
MouseHole only wanted to deal with the text/html and
application/xhtml+xml mime types. Fair enough - if want to parse
everything as HTML, it should probably obey that restriction. So I
fiddled with the checking:
unless script.match req.request_uri
logger.info "Skipping #{script.name} - no
match"
next
end
unless
script.document_converter.handles_type?(res.content_type)
logger.info "Skipping #{script.name} - wrong
content type"
next
end
logger.info "Executing #{script.name}"
script.execute( req, res )
And I added in this bit of code (which I think is a pretty messy way of
going about this, but it works in a pinch):
class HtmlDocumentConverter
def parse_string(body)
parse_xhtml(HTree.parse(body))
end
def output_string(document, stream)
document.write(stream)
end
def parse_xhtml( htree )
htree.each_child do |child|
if child.respond_to? :qualified_name
if child.qualified_name == 'html'
return HTree::Doc.new( child ).to_rexml
break
end
end
end
end
def handles_type?(type)
[
/^text\/html/,
/^application\/xhtml+xml/
].any? {|x| x === type }
end
end
class XmlDocumentConverter
def parse_string(body)
REXML::Document.new(body)
end
def output_string(document, stream)
document.write(stream)
end
def handles_type?(type)
p type
/^text\/xml/ === type
end
end
class UserScript
attr_accessor :document, :matches, :db, :request, :response,
:mtime, :active, :install_url, :document_converter
def document_converter s = nil; s ? @document_converter = s :
(@document_converter || HtmlDocumentConverter.new) end
def name s = nil; s ? @name = s : @name; end
......
Which allowed me to write this rule for rewriting the slashdot RSS feed:
MouseHole.script do
name "Slashdot Fullfeed RSS"
namespace "http://members.iinet.net.au/~soxbox/"
description "Converts the slashdot RSS feed to a full-content
feed"
include_match "http://rss.slashdot.org/Slashdot/slashdot"
document_converter XmlDocumentConverter.new
version "0.1"
rewrite do |req,res|
p "rewriting"
document.each_element('//item/') do |e|
e.each_element('description') {|x| x.remove}
doc = read_xhtml_from(e.attributes['rdf:about']
+ "&mode=nocomment")
desc = REXML::Element.new('description')
doc.each_element('//div[@class="intro"]') do |x|
s = ""
x.write(s)
s.gsub!("—","—")
desc.text = s
end
e << desc
end
end
end
Then I noticed something strange - my rewritten feed had no content in
the <link/> elements. Apparently, the reason for this is the
rexml/httpwrite.rb code - which seems to be designed to ensure that HTML
elements that aren't meant to have content don't end up having any
content... Why does this code exist? Shouldn't it by the task of the
tree builder to put the right things in the right nodes? Otherwise,
wouldn't it be better to have something walking the tree and trimming
the bad nodes before it gets output?
#####################################################################################
This email has been scanned by MailMarshal, an email content filter.
#####################################################################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rubyforge.org/pipermail/mousehole-scripters/attachments/20050927/d309c7da/attachment-0001.htm
More information about the Mousehole-scripters
mailing list