Maryland Web Scraping: September 2014

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.

Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.
If you want a higher amount of API requests you need to pay.
60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.

Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Is Web Scraping Relevant in Today's Business World?

Different techniques and processes have been created and developed over time to collect and analyze data. Web scraping is one of the processes that have hit the business market recently. It is a great process that offers businesses with vast amounts of data from different sources such as websites and databases.

It is good to clear the air and let people know that data scraping is legal process. The main reason is in this case is because the information or data is already available in the internet. It is important to know that it is not a process of stealing information but rather a process of collecting reliable information. Most people have regarded the technique as unsavory behavior. Their main basis of argument is that with time the process will be over flooded and therefore lead to parity in plagiarism.

We can therefore simply define web scraping as a process of collecting data from a wide variety of different websites and databases. The process can be achieved either manually or by the use of software. The rise of data mining companies has led to more use of the web extraction and web crawling process. Other main functions such companies are to process and analyze the data harvested. One of the important aspects about these companies is that they employ experts. The experts are aware of the viable keywords and also the kind of information which can create usable statistic and also the pages that are worth the effort. Therefore the role of data mining companies is not limited to mining of data but also help their clients be able to identify the various relationships and also build the models.

Some of the common methods of web scraping used include web crawling, text gripping, DOM parsing, and expression matching. The latter process can only be achieved through parsers, HTML pages or even semantic annotation. Therefore there are many different ways of scraping the data but most importantly they work towards the same goal. The main objective of using web scraping service is to retrieve and also compile data contained in databases and websites. This is a must process for a business to remain relevant in the business world.

The main questions asked about web scraping touch on relevance. Is the process relevant in the business world? The answer to this question is yes. The fact that it is employed by large companies in the world and has derived many rewards says it all. It is important to note that many people regarded this technology as a plagiarism tool and others consider it as a useful tool that harvests the data required for the business success.

Using of web scraping process to extract data from the internet for competition analysis is highly recommended. If this is the case, then you must be sure to spot any pattern or trend that can work in a given market.

Source:http://ezinearticles.com/?Is-Web-Scraping-Relevant-in-Todays-Business-World?&id=7091414

Scraping multiple emails from gmail

I have a gmail address with a lot of data that multiple people have sent me with similar subjects:

(for example: they all start with the string "123")

Each e-mail contains a table that looks like this

#, user, arbitrary number

    user1 5
    user2 3
    user3 4 etc.

How would I create a file that filters for all of these messages and then proceeds to take the information from the table from each of these e-mails?

Would a mail client make it easier? What sort of technology/ coding language should I use?

I'm not really sure what to look into or how to start this.

2 Answers

You can use modules of Perl for this task. Look at How can I read messages in a Gmail account from Perl? to know how to read messages from gmail account through POP client. Once you have read them, the messages can be easily processed with the regular expressions in perl. Ex:

if ($msg_subject =~ /^123.*/s) {
     # Add your logic for such mails
}

This is a really broad question: "How to get emails from Gmail?", "How to filter the emails?" and "How to parse structured data from an email?"

How to get emails from Gmail?

You can fetch emails from Gmail using the IMAP protocol. This can be done using the imaplib standard library.

Another StackOverflow user gave a snippet that does that (uses imaplib to fetch mails from a Gmail account) : How can I download emails from Gmail?

How to filter my mails?

You can easily filter your mails (for instance those starting with '123') by doing something like the following:

emails = get_emails()
filtered_emails = [email for email in emails if email.subject.startswith('123')]

How to parse data from my mails?

You know that each line of your mail have this format: ID USERNAME SOME_NUMBER so you only have to split each line using (space) as a delimiter.

for line in email:
    row = line.split(' ')
    id, username, number= row[0], row[1], row[2]
    # Do something with that info

Source:http://stackoverflow.com/questions/14434170/scraping-multiple-emails-from-gmail

Effective Business Intelligence with Web Scraping Services

In the global competitive environment today, the importance of business intelligence is largely felt by enterprises than ever before. It is an integral part of analytics domain and contributes majorly in improving overall efficiency and making businesses even more effective. In such a dynamic and complex environment, web scrapping services can make BI even more intelligent and worthwhile for enterprises to capitalize on. Valuable information – market, competition, customers, business performance – is critical for efficient business operations.

However, data collected from different sources are often erroneous, ambiguous, and inconsistent that can affect productivity dramatically, thus, affecting business intelligence. As a result, professional web scraping services have become indispensable for businesses today.
The Need of Web Scraping in Competitive Business Intelligence(BI):

Well-informed, accurate and effective decision making on time is important for responding to ever-changing and competitive business environment today. This can be better achieved by combining business intelligence with web scraping services
Gain quick access to valuable data such as sales pattern or consumer behavior, facilitating the business intelligence system to better understand and analyze the information at hand. This is most crucial for enterprises to have better insight to improving productivity and operational efficiency, with increased revenues
Combining both BI and web scraping is an innovative and reliable solution that help enterprises manage business performance, mapping alignment with corporate strategy
Establish clear and precise strategic direction with dedicated teams where both Business Intelligence and data scraping services can be integrated to add value
A major challenge for global enterprises is the need to transform the strategies into actionable performance, which can be achieved by combining BI and data scraping services

How it helps to Business novice:

Data scraping is an effective solution that makes critical business information more qualitative, manageable and accessible. This helps in collecting and analyzing critical data and insights, which makes the process of Business Intelligence simple and hassle-free. Furthermore, web scraping and monitoring improves the quality of information being collected, thus, helping the BI system by extracting and delivering only high-quality, reliable and accurate data from different sources. At the same time, it ensures that the right information is delivered to the right person, in the right format and at the right time, which is critical for Business Intelligence and analytics.

When web scraping services deliver the right information to the right people with right business analytics, it improves Business Intelligence remarkably. Data extraction and scrapping across different domains such as e-commerce and social media can improve BI outputs dramatically and better help in improving business conversions. In the way of aggregating and congregating data from multiple sources that enterprise BI tool isn’t already tracking, data scraping solutions improve the significance of Business Intelligence exceptionally, delivering better results. Business Intelligence is a comprehensive arena of effective decision-making that capitalizes on data scraping services as one of the most effective tools.

The effectiveness of integrating BI and web scraping finds its way in different applications such as industry research, market research, competition, and consumer analysis. With effective web scraping services, Business Intelligence emerges as one of the most critical aspects in business decision making.

http://www.hitechbposervices.com/blog/effective-business-intelligence-with-web-scraping-services/

Monday, 15 September 2014

Is it ok to scrape data from Google results?

Is Web Scraping Relevant in Today's Business World?

Scraping multiple emails from gmail

Effective Business Intelligence with Web Scraping Services