Monday, 15 September 2014

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.

Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API
  •     You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.
  •     If you want a higher amount of API requests you need to pay.
  •     60 requests per hour cost 2000 USD per year, more queries require a custom deal.
2) Scrape the normal result pages
  •     Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
  •     If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
  •     By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
  •     There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.
Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

No comments:

Post a Comment