01-07-2014, 03:50 PM
(01-07-2014, 02:18 PM)Phaze Wrote: On the subject of retrieving data, wouldn't it be theoretically possible to make a DOM-parsing program that will auto-download the cached pages for you from Google and parse the data? I've never done this sort of thing before* but if it's really gonna take ages to do, it might be worth for me or one of the staff to look into.
*While I haven't done something to scrape pages automatically, I once made a DOM-parsing program in PHP that I never finished to clean up saved pages from a forum to archive them neatly and fix the broken CSS.
And a chat with Dazz tells me that Google rejects crawling-like activity. Oh well '_;
I wrote a scraper but the IP address I was using got banned pretty quickly, even if I randomized the IP/interval between requests Google looks for patterns in the search requests and will block too many similar requests with a CAPTCHA. The funny thing is, the biggest web crawler in the world doesn't let you crawl their servers. Huh.