Notes:
The mission of this release was to add even more powerful features,
like crawling to detail pages or compound example specification,
as well as fixing the most frequently popping-up bugs. Scraping
of concrete sites is more and more frequently the cause for new
features and bugfixes, which in my opinion means that the
framework is beginning to make sense: from a shiny toy which
looks cool and everybody wants to play with, it is moving
towards a tool which you reach after if you seriously want
to scrape a site.
The new stuff in this release is 99% scraping related - if
you are looking for new features in the navigation part,
probably the next version will be for you, where I will
concentrate more on adding new widgets and possibilities
to the navigation process. Firewatir integration is very
close, too - perhaps already the next release will
support FireWatir navigation!
Changes:
* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern had to be a string.
Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
* [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
(but still possible of course) to specify the whole content of the node - nodes that
contain the string/match the regexp will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
some_url 'href', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the next
link text is not an <a>, traverse down until the <a> is found; if it is
still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed out
(or otherwise present but has no href attribute (Credit: sorry, I could not find in the comments :(
* [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization
|