Files | Admin

Notes:

Release Name: scrubyt-0.2.6

Notes:
The mission of this release was to add even more powerful features,
like crawling to detail pages or compound example specification,
as well as fixing the most frequently popping-up bugs. Scraping
of concrete sites is more and more frequently the cause for new
features and bugfixes, which in my opinion means that the
framework is beginning to make sense: from a shiny toy which
looks cool and everybody wants to play with, it is moving
towards a tool which you reach after if you seriously want
to scrape a site.

The new stuff in this release is 99% scraping related - if
you are looking for new features in the navigation part,
probably the next version will be for you, where I will
concentrate more on adding new widgets and possibilities
to the navigation process. Firewatir integration is very
close, too - perhaps already the next release will
support FireWatir navigation!


Changes: * [NEW] Automatically crawling to and extracting from detail pages * [NEW] Compound example specification: So far the example of a pattern had to be a string. Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'} * [NEW] More sophisticated example specification: Possible to use regexp as well, and need not (but still possible of course) to specify the whole content of the node - nodes that contain the string/match the regexp will be returned, too * [NEW] Possibility to force writing text in case of non-leaf nodes * [NEW] Crawling to the next page now possible via image links as well * [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors) * [NEW] Implementation of crawling to the next page with different methods * [NEW] Heuristics: if something ends with _url, it is a shortcut for: some_url 'href', :type => :attribute * [FIX] Crawling to the next page (the broken google example): if the next link text is not an <a>, traverse down until the <a> is found; if it is still not found, traverse up until it is found * [FIX] Crawling to next pages does not break if the next link is greyed out (or otherwise present but has no href attribute (Credit: sorry, I could not find in the comments :( * [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup * [NEW] Correct exporting of detail page extractors * [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa) * [NEW] New examples for the new featutres * [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization