From transfire at gmail.com Fri Jul 14 07:36:40 2006 From: transfire at gmail.com (TRANS) Date: Fri, 14 Jul 2006 07:36:40 -0400 Subject: [libxml-devel] planned stuff In-Reply-To: <20060410083646.GW431@mailhost.gigave.com> References: <20060410083646.GW431@mailhost.gigave.com> Message-ID: <4b6f054f0607140436yb596758q93bc0e84dc84c88f@mail.gmail.com> On 4/10/06, Sean Chittenden wrote: > > FYI, here's a list of things I'm planning to work on over the coming week. > > > > * SAX api - ability to register event callback procs > > * Look into disabling DTD loading > > * Make a start on libxml reader API > > * Integrate patches from Mark Van Holstyn > > * Update website, add benchmarks, xmlsoft.org link, etc. > > * Other things that escape me right now > > I haven't tested it yet, but has anyone had any luck with fragments? > > In case anyone's interested, I'm working on caching XML fragments in > memcache, then transforming the assembled XML doc into HTML via XSLT. > Think forum software and the suck that is phpBB. -sc How's development going? T. From ruby at thomaszone.com Mon Jul 17 21:34:18 2006 From: ruby at thomaszone.com (Mark Thomas) Date: Mon, 17 Jul 2006 21:34:18 -0400 Subject: [libxml-devel] Request to add HTML parsing Message-ID: <44BC3A9A.6070809@thomaszone.com> I'm switching to Ruby from Perl, and currently I do all my HTML parsing in perl's XML::LibXML. Applying XPath to parse HTML is extremely powerful and fast, fast, fast in libxml. Can you add that feature to the Ruby one? I think it would be easy to do; it's just a flag on the parser, which tells libxml to create a DOM from HTML instead of XML, and all the XML methods then magically work on the HTML! So it should be really low hanging fruit. Sweet, delicious fruit. Please consider it! Thanks, - Mark. From transfire at gmail.com Mon Jul 17 22:20:17 2006 From: transfire at gmail.com (TRANS) Date: Mon, 17 Jul 2006 22:20:17 -0400 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <44BC3A9A.6070809@thomaszone.com> References: <44BC3A9A.6070809@thomaszone.com> Message-ID: <4b6f054f0607171920o50c4ac0fhe0b6a0981ca0d57e@mail.gmail.com> On 7/17/06, Mark Thomas wrote: > I'm switching to Ruby from Perl, and currently I do all my HTML parsing in > perl's XML::LibXML. Applying XPath to parse HTML is extremely powerful and > fast, fast, fast in libxml. > > Can you add that feature to the Ruby one? I think it would be easy to do; > it's just a flag on the parser, which tells libxml to create a DOM from > HTML instead of XML, and all the XML methods then magically work on the > HTML! > > So it should be really low hanging fruit. Sweet, delicious fruit. > > Please consider it! That would be great! Except for one problem. Once again it seems all the libxml developers have vanished. :( I'm not a C coder or I would be working on it myself. What has happened? Why is this project so cursed? T. From zdennis at mktec.com Tue Jul 18 09:16:11 2006 From: zdennis at mktec.com (zdennis) Date: Tue, 18 Jul 2006 09:16:11 -0400 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <4b6f054f0607171920o50c4ac0fhe0b6a0981ca0d57e@mail.gmail.com> References: <44BC3A9A.6070809@thomaszone.com> <4b6f054f0607171920o50c4ac0fhe0b6a0981ca0d57e@mail.gmail.com> Message-ID: <44BCDF1B.5010807@mktec.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 TRANS wrote: > On 7/17/06, Mark Thomas wrote: > >>I'm switching to Ruby from Perl, and currently I do all my HTML parsing in >>perl's XML::LibXML. Applying XPath to parse HTML is extremely powerful and >>fast, fast, fast in libxml. >> >>Can you add that feature to the Ruby one? I think it would be easy to do; >>it's just a flag on the parser, which tells libxml to create a DOM from >>HTML instead of XML, and all the XML methods then magically work on the >>HTML! >> >>So it should be really low hanging fruit. Sweet, delicious fruit. >> >>Please consider it! > > > That would be great! Except for one problem. Once again it seems all > the libxml developers have vanished. :( I'm not a C coder or I would > be working on it myself. What has happened? Why is this project so > cursed? > I don't think they have vanished. It's been a little over a week since Ross has sent an email to this ML. Perhaps he is on vacation or is finishing a project outside of libxml atm. Zach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEvN8aMyx0fW1d8G0RAgG/AJ9EOTA1yaBXrUMrlws9IaNnIYYZcQCeNGE/ h8Pm0J7KggB0P+bLXfhyzZw= =qRcs -----END PGP SIGNATURE----- From transfire at gmail.com Tue Jul 18 11:30:35 2006 From: transfire at gmail.com (TRANS) Date: Tue, 18 Jul 2006 11:30:35 -0400 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <44BCDF1B.5010807@mktec.com> References: <44BC3A9A.6070809@thomaszone.com> <4b6f054f0607171920o50c4ac0fhe0b6a0981ca0d57e@mail.gmail.com> <44BCDF1B.5010807@mktec.com> Message-ID: <4b6f054f0607180830h4bb8172bh3ba3c84d3fc4d28c@mail.gmail.com> On 7/18/06, zdennis wrote: > I don't think they have vanished. It's been a little over a week since Ross has sent an email to this ML. Perhaps he is on > vacation or is finishing a project outside of libxml atm. Oh! I hope it is so! Sorry if I've jumped to conclusions. I sent an email to Ross over a week ago and have not heard from him. I checked the list archives and saw no activity for July. I guess I am a little on edge about it b/c I want to use this library very badly but it is still a little too buggy, and also I saw this project "die" once before and I did a lot of work to get it revived. T. From zdennis at mktec.com Tue Jul 18 11:52:39 2006 From: zdennis at mktec.com (zdennis) Date: Tue, 18 Jul 2006 11:52:39 -0400 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <4b6f054f0607180830h4bb8172bh3ba3c84d3fc4d28c@mail.gmail.com> References: <44BC3A9A.6070809@thomaszone.com> <4b6f054f0607171920o50c4ac0fhe0b6a0981ca0d57e@mail.gmail.com> <44BCDF1B.5010807@mktec.com> <4b6f054f0607180830h4bb8172bh3ba3c84d3fc4d28c@mail.gmail.com> Message-ID: <44BD03C7.8080108@mktec.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 TRANS wrote: > On 7/18/06, zdennis wrote: > > >>I don't think they have vanished. It's been a little over a week since Ross has sent an email to this ML. Perhaps he is on >>vacation or is finishing a project outside of libxml atm. > > > Oh! I hope it is so! > > Sorry if I've jumped to conclusions. I sent an email to Ross over a > week ago and have not heard from him. I checked the list archives and > saw no activity for July. I guess I am a little on edge about it b/c I > want to use this library very badly but it is still a little too > buggy, and also I saw this project "die" once before and I did a lot > of work to get it revived. > it can't completely die, we use these bindings at work... and we rely on it's success. Granted we only do basic xml parsing, so we are less inclined to jump in to the other area's of development without a need. Ross / Sean, if you guys need help with something I'd be interested in having myself and two developers on my team at the office do an online chat with you or phone conference (we can host) and perhaps you can explain to us how certain things function to bring us up to speed, and then I can work it in the schedule so few hours every 2 weeks we help out. This is a possibility I'd be willing to look into. Thoughts? Zach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEvQPHMyx0fW1d8G0RAuWHAJ9klVsKNDQ1hJBYF8oNtcITrMmh6ACfThds 8fAjI9oBYaMxwG6z8dHo4MY= =3h+Q -----END PGP SIGNATURE----- From strass at strasslab.net Tue Jul 18 12:33:02 2006 From: strass at strasslab.net (Yann Klis) Date: Tue, 18 Jul 2006 18:33:02 +0200 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <44BC3A9A.6070809@thomaszone.com> References: <44BC3A9A.6070809@thomaszone.com> Message-ID: <44BD0D3E.8000402@strasslab.net> If you'd like to do HTML parsing, you'd better use RubyfoulSoup http://www.crummy.com/software/RubyfulSoup/ which is a port of BeautifulSoup to Python, and IMHO better adapted to HTML parsing than libxml. ++ yk Mark Thomas a ?crit : > I'm switching to Ruby from Perl, and currently I do all my HTML parsing in > perl's XML::LibXML. Applying XPath to parse HTML is extremely powerful and > fast, fast, fast in libxml. > > Can you add that feature to the Ruby one? I think it would be easy to do; > it's just a flag on the parser, which tells libxml to create a DOM from > HTML instead of XML, and all the XML methods then magically work on the > HTML! > > So it should be really low hanging fruit. Sweet, delicious fruit. > > Please consider it! > > Thanks, > - Mark. > > _______________________________________________ > libxml-devel mailing list > libxml-devel at rubyforge.org > http://rubyforge.org/mailman/listinfo/libxml-devel > From transfire at gmail.com Tue Jul 18 12:49:33 2006 From: transfire at gmail.com (TRANS) Date: Tue, 18 Jul 2006 12:49:33 -0400 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <44BD0D3E.8000402@strasslab.net> References: <44BC3A9A.6070809@thomaszone.com> <44BD0D3E.8000402@strasslab.net> Message-ID: <4b6f054f0607180949j5654a81awf53ede4b2ac6b778@mail.gmail.com> On 7/18/06, Yann Klis wrote: > If you'd like to do HTML parsing, you'd better use RubyfoulSoup > http://www.crummy.com/software/RubyfulSoup/ which is a port of > BeautifulSoup to Python, and IMHO better adapted to HTML parsing than > libxml. Isn't that REXML based though? Be that is it may, I think his point was that libxml can handle HTML if asked to do so? So why not make that an option. Is that right? T. From aitorgr at gmail.com Tue Jul 18 13:08:20 2006 From: aitorgr at gmail.com (Aitor Garay-Romero) Date: Tue, 18 Jul 2006 19:08:20 +0200 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <11a866cb0607181001icde21afo6d32e5f31098fca9@mail.gmail.com> References: <44BC3A9A.6070809@thomaszone.com> <11a866cb0607181001icde21afo6d32e5f31098fca9@mail.gmail.com> Message-ID: <11a866cb0607181008g4e48897ei6347bb431494ce19@mail.gmail.com> OK, the mailing list software complaints that the message is too big. I send the files compressed (please read below). On 7/18/06, Aitor Garay-Romero wrote: > > Hi there!, > > I did some work myself to allow libxml-ruby to parse HTML directly > from an string. I was thinking on implementing some extra features and then > sending the patches to the developers. > > But i'm busy with some stuff and i didn't find a moment to finish and > send it. > > Anyway find attached to this message the 3 files i modified. They are > based on libxml-ruby-0.3.8. Just make a diff to the originals to see what > changed. > > Hope that it's useful. > > /AITOR > > > On 7/18/06, Mark Thomas wrote: > > > > I'm switching to Ruby from Perl, and currently I do all my HTML parsing > > in > > perl's XML::LibXML. Applying XPath to parse HTML is extremely powerful > > and > > fast, fast, fast in libxml. > > > > Can you add that feature to the Ruby one? I think it would be easy to > > do; > > it's just a flag on the parser, which tells libxml to create a DOM from > > HTML instead of XML, and all the XML methods then magically work on the > > HTML! > > > > So it should be really low hanging fruit. Sweet, delicious fruit. > > > > Please consider it! > > > > Thanks, > > - Mark. > > > > _______________________________________________ > > libxml-devel mailing list > > libxml-devel at rubyforge.org > > http://rubyforge.org/mailman/listinfo/libxml-devel > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/libxml-devel/attachments/20060718/da78d8c1/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: parse_html.tar.gz Type: application/x-gzip Size: 11038 bytes Desc: not available Url : http://rubyforge.org/pipermail/libxml-devel/attachments/20060718/da78d8c1/attachment-0001.gz From ruby at thomaszone.com Tue Jul 18 15:05:46 2006 From: ruby at thomaszone.com (Mark Thomas) Date: Tue, 18 Jul 2006 15:05:46 -0400 (EDT) Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <4b6f054f0607180949j5654a81awf53ede4b2ac6b778@mail.gmail.com> References: <44BC3A9A.6070809@thomaszone.com> <44BD0D3E.8000402@strasslab.net> <4b6f054f0607180949j5654a81awf53ede4b2ac6b778@mail.gmail.com> Message-ID: <2544.146.142.45.188.1153249546.squirrel@mark.thomaszone.com> TRANS wrote: > On 7/18/06, Yann Klis wrote: >> If you'd like to do HTML parsing, you'd better use RubyfoulSoup >> http://www.crummy.com/software/RubyfulSoup/ which is a port of >> BeautifulSoup to Python, and IMHO better adapted to HTML parsing than >> libxml. > > Isn't that REXML based though? Be that is it may, I think his point > was that libxml can handle HTML if asked to do so? So why not make > that an option. Is that right? Yes. The functionality is already in libxml, it just has to be exposed in the Ruby bindings. Why use XPath for HTML? Using XPath for parsing HTML is gaining popularity in many languages. There's good reason for it. XPath is like regular expressions for tree structures: it's extremely powerful and once learned, it can be used for many other tasks (XSLT, XQuery, etc). Another advantage is that XPath expressions are just strings. And anything you want to extract can virtually always be expressed in just one XPath string. This makes them externalizable--create a config file with your XPath information. If the page you're scraping changes, simply update the file; no need to change any code. And finally, there are plenty of nifty helping tools. Check out the 'XPather' and 'XPath Checker' extensions to Firefox. Select any part of a page, and Firefox will give you the XPath for it. Manipulate the XPath to fine-tune it and see the results live. It's really cool. So even though I *could* use a different HTML parser (like RubyfulSoup), I can't bring myself to do it. It wouldn't be as powerful, as standardized, or as familiar to me as XPath. I really feel that the future of tag-based parsing is XPath, and I'd hate to see Ruby fall behind in this respect. I realize that Tidy + REXML is a potential workaround, but it's a bit of a hack and Tidy barfs on some bad HTML. - Mark. (Aitor: I'll try your stuff and let you know how it goes. Thanks!) From rosco at roscopeco.co.uk Wed Jul 19 08:13:05 2006 From: rosco at roscopeco.co.uk (Ross Bamford) Date: Wed, 19 Jul 2006 13:13:05 +0100 Subject: [libxml-devel] Request to add HTML parsing In-Reply-To: <44BCDF1B.5010807@mktec.com> References: <44BC3A9A.6070809@thomaszone.com> <4b6f054f0607171920o50c4ac0fhe0b6a0981ca0d57e@mail.gmail.com> <44BCDF1B.5010807@mktec.com> Message-ID: On Tue, 18 Jul 2006 14:16:11 +0100, zdennis wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > TRANS wrote: >> ... Once again it seems all >> the libxml developers have vanished. :( I'm not a C coder or I would >> be working on it myself. What has happened? Why is this project so >> cursed? >> > > I don't think they have vanished. It's been a little over a week since > Ross has sent an email to this ML. Perhaps he is on > vacation or is finishing a project outside of libxml atm. > I've just been taking a bit of a break since we're having the hottest weather on record in the UK, and I had to wait for an internet connection at my new place anyway... I've just today been connected up again and I'll be back into libxml-ruby very shortly :) -- Ross Bamford - rosco at roscopeco.co.uk From grant at janrain.com Wed Jul 19 14:11:49 2006 From: grant at janrain.com (Grant Monroe) Date: Wed, 19 Jul 2006 11:11:49 -0700 Subject: [libxml-devel] Default namespaces Message-ID: <826ff8bf0607191111p117361e9v1b49a946dc9fdf69@mail.gmail.com> Hello, I'm trying to figure out how to set the default namespace on a node I've just created. Setting the xmlns attribute like node['xmlns'] = 'http://www.example.com/' doesn't seem to effect node.ns_def, and node.namespace = 'http://www.example.com/' treats http as the ns prefix. Is this possible with this wrapper? -- Grant Monroe JanRain, Inc. From neelu_dhiman at persistent.co.in Thu Jul 27 10:57:11 2006 From: neelu_dhiman at persistent.co.in (neelu_dhiman at persistent.co.in) Date: Thu, 27 Jul 2006 20:27:11 +0530 (IST) Subject: [libxml-devel] How to retrieve attributes and their values Message-ID: <20060727202711.AMO63288@mail6.persistent.co.in> Hi, I am trying some sample code from http://xmlsoft.org/examples/reader1.c to parse an xml document and retirve its elements and corresponding values. I want to retrieve the Attributes (with values) of the Elements, wherever they are, along with the Elements. Can you please tell me what function can I use for the same or is there some sample code for this? Thanks, Neelu. DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Pvt. Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Pvt. Ltd. does not accept any liability for virus infected mails.