I'd like to fetch results from Google using curl to detect potential
duplicate content. Is there a high risk of being banned by Google?
Google will eventually block your IP when you exceed a certain amount of requests.
Google disallows automated access in their TOS, so if you accept their terms you would break them.
That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)
There are two options to scrape Google results:
1) Use their API
Google will eventually block your IP when you exceed a certain amount of requests.
Google disallows automated access in their TOS, so if you accept their terms you would break them.
That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)
There are two options to scrape Google results:
1) Use their API
- You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.
- If you want a higher amount of API requests you need to pay.
- 60 requests per hour cost 2000 USD per year, more queries require a custom deal.
- Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
- If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
- By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
- There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.