<div dir="ltr">Charlie,<div><br></div><div>I am running on OSX and RedHat. I am using the Node#find method with an XPath expression for the currently desired node in the default namespace of the document. The crashes stopped happening when I set my nodes variable to nil before calling GC.start. The memory does not spike too much if I call GC.start after every single Node#find but since parsing a single document into the required number of ruby objects necessesitates calling Node#find over a thousand times GC.start is really slowing things down.</div>
<div><br></div><div>From what I can tell calling Node#find on such a large document is causing Ruby to add extra object heaps which increases my memory usage in a way that the program does not recover from. This is unfortunate since I want to run multiple processes per box but each process is using several hundred megabytes of RAM after parsing a few large documents.</div>
<div><br></div><div>The SAX parser with empty callbacks can rip through the document in about 17ms which is very fast in my opinion. The speed problem arrises when I try to do anything in the callbacks. The nature of the program and the structure of the XML requires me to do quite few lookups in a series of hashes to determine the type of the current node and the type of each text element. When SAX parsing I have to hit the hashes more often since I don't have as much context information available as I do with a recursive depth first document walk with the document parser node objects. With the necessary code in the callbacks I was seeing parse times around 400ms which is about twice as slow as the document based approach.</div>
<div><br></div><div>XMLReader looks very interesting from the API docs but I am not sure that I grok how to actually use it. I will keep searching for resources but if you know of any examples of usage out there I would love to read some code.</div>
<div><br></div><div>Thank you,</div><div>Matt Margolis</div><div><br><br><div class="gmail_quote">2008/8/16 Charlie Savage <span dir="ltr"><<a href="mailto:cfis@savagexi.com">cfis@savagexi.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi Matt,<div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I am making the parsed ruby objects available to a Rails application and I find that if I call GC.start when using the library with Rails that it takes several seconds to garbage collect and sometimes crashes. If I call GC.start in the loop when the program is running as a standalone process then GC.start returns in a few dozen milliseconds.<br>
</blockquote>
<br></div>
What platform are you using? Can you run a debug version and get a stack trace so we can see what is going on? Are you using XPath? If so, make sure to free pointers to your XPath result objects and call GC.start before the associated documents get freed (see the rdocs for more info, document#find I think it is).<div class="Ih2E3d">
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I wrote a SAX style parser using libxml-ruby that does not suffer from the memory growth but it is about 30 times slower than the document based parser so I am really trying to make the document based approach work.<br>
</blockquote>
<br></div>
Why do you suppose SAX is so much slower. It should be a lot faster since it doesn't build an in-memory tree.<br>
<br>
Any chance the XMLReader would work for you?<br><font color="#888888">
<br>
Charlie<br>
</font><br>_______________________________________________<br>
libxml-devel mailing list<br>
<a href="mailto:libxml-devel@rubyforge.org">libxml-devel@rubyforge.org</a><br>
<a href="http://rubyforge.org/mailman/listinfo/libxml-devel" target="_blank">http://rubyforge.org/mailman/listinfo/libxml-devel</a><br></blockquote></div><br></div></div>