Saturday, 11 July 2015

Web Scraping : To Scrape or Not to Scrape?

Web scraping is on the rise and its legality is being debated. The future of big data could hang in the balance.

I decided that I might want to stop writing about all these successful dot-com businesses and get into the act myself. I mean, how hard could it be, aside from the fact that I have no expertise in any particular vertical, no technological knowledge, and no money? That last was a bit of a problem because I was going to need big data, and big data doesn't come cheap.

So, one day I'm talking to a direct-tech wizard and he says, “Why don't you just find a business you want to be in, find the most successful company, go to their website, and scrape some of their data?”

Scrape? My dad was a house painter. I used to help him during summers. The only scraping I knew was done with a putty knife. But that's what God invented the Internet for. Google turned up endless Web scraping services and I went to one called Automation Anywhere.

Its homepage told me not to try scraping on my own, that I could pay them as little as $1,995 for a program that would have me scraping away in minutes with no programming expertise. A video showed me how. Suppose I wanted to locate all the assisted living facilities in Detroit? (Bedpans! There's a wide-open Web business!) Automation Anywhere showed me how I could request a data pattern—name, address, phone, and service area of targets—apply their program to a rehab facility listing, and minutes later be in possession of tidy customer list on a spreadsheet. A box popped up asking if I had any questions for a live account manager. I did.

“I'm interested in this, but is it totally legal?” I asked Shine, the account rep.

Shine was slow to respond and came back disappointingly noncommittal: “You need to install the software at your end. Hence, you will need to check at your end for legal documents for the website.”

Shine had obviously been reading the same European news sites that I had. Last year Irish airline Ryanair filed suit against PR Aviation, a Dutch airfare comparison site, charging it with copyright infringement and breach of contract for scraping flight data from its site. The Court of Justice of the European Union (ECJ) dismissed the suit, saying the scraping amounted to “normal use” of a website. However, the ECJ did potentially leave the door open for businesses with unprotected databases, such as Ryanair, to establish contractual limitations on use of their databases by third parties. That opening, should it be entered into by airlines, might have businesses such as Expedia, Orbitz, and Priceline reimagining their business plans.

And it could have any other businesses that load up third-party data files through scraping activities doing the same. The reality, however, is that that's a big “and.”

“The airlines have been halfway successful taking travel agents to court, but it can take five years and then they lose,” said Gus Cunningham, CEO of ScrapeSentry, one of only “three and a half” companies, in his words, that block scrapers from websites.

Professional scrapers are not only out-front and plentiful—as the Google search demonstrated—they're also nimble and expensive to chase away. Basically, Cunningham said, it's a matter of stopping scrapers at the website door among the airlines, e-coms, real estate sites, and online gambling companies that are scraped the most. Cunningham's company monitors inbound Web traffic and uses an analysis engine to block suspect visitors per parameters set down by clients. ScanSentry has a nine-year-old database that helps it identify bad actors, much like Interpol with its criminal database. Then a human element must enter the process.

Some of Cunningham's clients feed bogus information to competitors identified as scrapers to ferret them out. In many cases, though, they opt to turn their heads. “Airlines have some flights where they just want to get as many butts as they can in the seats, so they won't concentrate their blocking efforts on those. They'll concentrate on the routes that are always jammed,” Cunningham said.

Unlike botnets that steal money by, say, serving bogus websites to siphon off programmatic ad dollars, scraping is not overtly criminal. In the wide sphere of digital commerce, it's probably most common that the scrap-ee is also a scrap-er. How vigilant vulnerable industries become, and how protective courts and law enforcement agencies grow, will depend on how much scraping activity increases. Cunningham said it's growing fast. More than one fifth of visitors to client websites last year were scrapers, according to a ScanSentry study. Among travel companies, meanwhile, scrapers doubled from 15% in 2013 to 33% last year.

“And,” Cunningham noted, “It is stealing.”

Source: http://www.dmnews.com/direct-line-blog/to-scrape-or-not-to-scrape/article/422662/

Friday, 26 June 2015

Data Scraping - Enjoy the Appeal of the Hand Scraped Flooring

Hand scraped flooring is appreciated for the character it brings into the home. This style of flooring relies on hand scraped planks of wood and not the precise milled boards. The irregularities in the planks provide a certain degree of charm and help to create a more unique feature in the home.

Distressed vs. Hand scraped

There are two types of flooring in the market that have an aged and unique charm with a non perfect finish. However, there is a significant difference in the process used to manufacture the planks. The more standard distresses flooring is cut on a factory production line. The grooves, scratches, dents, or other irregularities in these planks are part of the manufacturing process and achieved by rolling or pressed the wood onto a patterned surface.

The real hand scraped planks are made by craftsmen and they work on each plant individually. By using this working technique, there is complete certainty that each plank will be unique in appearance.

Scraping the planks

The hand scraping process on the highest-quality planks is completed by the trained carpenter or craftsmen who will produce a high-quality end product and take great care in their workmanship. It can benefit to ask the supplier of the flooring to see who completes the work.

Beside the well scraped lumber, there are also those planks that have been bought from the less than desirable sources. This is caused by the increased demand for this type of flooring. At the lower end of the market the unskilled workers are used and the end results aren't so impressive.

The high-quality plank has the distinctive look that feels and functions perfectly well as solid flooring, while the low-quality work can appear quite ugly and cheap.

Even though it might cost a little bit more, it benefits to source the hardwood floor dealers that rely on the skilled workers to complete the scraping process.

Buying the right lumber

Once a genuine supplier is found, it is necessary to determine the finer aspects of the wooden flooring. This hand scraped flooring is available in several hardwoods, such as oak, cherry, hickory, and walnut. Plus, it comes in many different sizes and widths. A further aspect relates to the finish with darker colored woods more effective at highlighting the character of the scraped boards. This makes the shadows and lines appear more prominent once the planks have been installed at home.

Why not visit Bellacerafloors.com for the latest collection of luxury floor materials, including the Handscraped Hardwood Flooring.

Source: http://ezinearticles.com/?Enjoy-the-Appeal-of-the-Hand-Scraped-Flooring&id=8995784

Saturday, 20 June 2015

Rvest: easy web scraping with R

Rvest is new package that makes it easy to scrape (or harvest) data from html web pages, by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

  html_node("strong span") %>%
  html_text() %>%
  as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

  html_nodes("#titleCast .itemprop span") %>%
  html_text()

#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"   

#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"   

#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

  html_nodes("table") %>%
  .[[3]] %>%
  html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1...  marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

    Detect and repair text encoding problems with guess_encoding() and repair_encoding().
    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://www.r-bloggers.com/rvest-easy-web-scraping-with-r/

Tuesday, 9 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Wednesday, 3 June 2015

Getting Data from the Web Scraping

You’ve tried everything else, and you haven’t managed to get your hands on the data you want. You’ve found the data on the web, but, alas — no download options are available and copy-paste has failed you. Fear not, there may still be a way to get the data out. For example you can:

•    Get data from web-based APIs, such as interfaces provided by online databases and many modern web applications (including Twitter, Facebook and many others). This is a fantastic way to access government or commercial data, as well as data from social media sites.

•    Extract data from PDFs. This is very difficult, as PDF is a language for printers and does not retain much information on the structure of the data that is displayed within a document. Extracting information from PDFs is beyond the scope of this book, but there are some tools and tutorials that may help you do it.

•    Screen scrape web sites. During screen scraping, you’re extracting structured content from a normal web page with the help of a scraping utility or by writing a small piece of code. While this method is very powerful and can be used in many places, it requires a bit of understanding about how the web works.

With all those great technical options, don’t forget the simple options: often it is worth to spend some time searching for a file with machine-readable data or to call the institution which is holding the data you want.

In this chapter we walk through a very basic example of scraping data from an HTML web page.

What is machine-readable data?

The goal for most of these methods is to get access to machine-readable data. Machine readable data is created for processing by a computer, instead of the presentation to a human user. The structure of such data relates to contained information, and not the way it is displayed eventually. Examples of easily machine-readable formats include CSV, XML, JSON and Excel files, while formats like Word documents, HTML pages and PDF files are more concerned with the visual layout of the information. PDF for example is a language which talks directly to your printer, it’s concerned with position of lines and dots on a page, rather than distinguishable characters.

Scraping web sites: what for?

Everyone has done this: you go to a web site, see an interesting table and try to copy it over to Excel so you can add some numbers up or store it for later. Yet this often does not really work, or the information you want is spread across a large number of web sites. Copying by hand can quickly become very tedious, so it makes sense to use a bit of code to do it.

The advantage of scraping is that you can do it with virtually any web site — from weather forecasts to government spending, even if that site does not have an API for raw data access.

What you can and cannot scrape

There are, of course, limits to what can be scraped. Some factors that make it harder to scrape a site include:

•    Badly formatted HTML code with little or no structural information e.g. older government websites.

•    Authentication systems that are supposed to prevent automatic access e.g. CAPTCHA codes and paywalls.

•    Session-based systems that use browser cookies to keep track of what the user has been doing.

•    A lack of complete item listings and possibilities for wildcard search.

•    Blocking of bulk access by the server administrators.

Another set of limitations are legal barriers: some countries recognize database rights, which may limit your right to re-use information that has been published online. Sometimes, you can choose to ignore the license and do it anyway — depending on your jurisdiction, you may have special rights as a journalist. Scraping freely available Government data should be fine, but you may wish to double check before you publish. Commercial organizations — and certain NGOs — react with less tolerance and may try to claim that you’re “sabotaging” their systems. Other information may infringe the privacy of individuals and thereby violate data privacy laws or professional ethics.

Tools that help you scrape

There are many programs that can be used to extract bulk information from a web site, including browser extensions and some web services. Depending on your browser, tools like Readability (which helps extract text from a page) or DownThemAll (which allows you to download many files at once) will help you automate some tedious tasks, while Chrome’s Scraper extension was explicitly built to extract tables from web sites. Developer extensions like FireBug (for Firefox, the same thing is already included in Chrome, Safari and IE) let you track exactly how a web site is structured and what communications happen between your browser and the server.

ScraperWiki is a web site that allows you to code scrapers in a number of different programming languages, including Python, Ruby and PHP. If you want to get started with scraping without the hassle of setting up a programming environment on your computer, this is the way to go. Other web services, such as Google Spreadsheets and Yahoo! Pipes also allow you to perform some extraction from other web sites.

How does a web scraper work?

Web scrapers are usually small pieces of code written in a programming language such as Python, Ruby or PHP. Choosing the right language is largely a question of which community you have access to: if there is someone in your newsroom or city already working with one of these languages, then it makes sense to adopt the same language.

While some of the click-and-point scraping tools mentioned before may be helpful to get started, the real complexity involved in scraping a web site is in addressing the right pages and the right elements within these pages to extract the desired information. These tasks aren’t about programming, but understanding the structure of the web site and database.

When displaying a web site, your browser will almost always make use of two technologies: HTTP is a way for it to communicate with the server and to request specific resource, such as documents, images or videos. HTML is the language in which web sites are composed.

The anatomy of a web page

Any HTML page is structured as a hierarchy of boxes (which are defined by HTML “tags”). A large box will contain many smaller ones — for example a table that has many smaller divisions: rows and cells. There are many types of tags that perform different functions — some produce boxes, others tables, images or links. Tags can also have additional properties (e.g. they can be unique identifiers) and can belong to groups called ‘classes’, which makes it possible to target and capture individual elements within a document. Selecting the appropriate elements this way and extracting their content is the key to writing a scraper.

Viewing the elements in a web page: everything can be broken up into boxes within boxes.

To scrape web pages, you’ll need to learn a bit about the different types of elements that can be in an HTML document. For example, the <table> element wraps a whole table, which has <tr> (table row) elements for its rows, which in turn contain <td> (table data) for each cell. The most common element type you will encounter is <div>, which can basically mean any block of content. The easiest way to get a feel for these elements is by using the developer toolbar in your browser: they will allow you to hover over any part of a web page and see what the underlying code is.

Tags work like book ends, marking the start and the end of a unit. For example <em> signifies the start of an italicized or emphasized piece of text and </em> signifies the end of that section. Easy.

Figure 57. The International Atomic Energy Agency’s (IAEA) portal (news.iaea.org)

An example: scraping nuclear incidents with Python

NEWS is the International Atomic Energy Agency’s (IAEA) portal on world-wide radiation incidents (and a strong contender for membership in the Weird Title Club!). The web page lists incidents in a simple, blog-like site that can be easily scraped.

To start, create a new Python scraper on ScraperWiki and you will be presented with a text area that is mostly empty, except for some scaffolding code. In another browser window, open the IAEA site and open the developer toolbar in your browser. In the “Elements” view, try to find the HTML element for one of the news item titles. Your browser’s developer toolbar helps you connect elements on the web page with the underlying HTML code.

Investigating this page will reveal that the titles are <h4> elements within a <table>. Each event is a <tr> row, which also contains a description and a date. If we want to extract the titles of all events, we should find a way to select each row in the table sequentially, while fetching all the text within the title elements.

In order to turn this process into code, we need to make ourselves aware of all the steps involved. To get a feeling for the kind of steps required, let’s play a simple game: In your ScraperWiki window, try to write up individual instructions for yourself, for each thing you are going to do while writing this scraper, like steps in a recipe (prefix each line with a hash sign to tell Python that this not real computer code). For example:

# Look for all rows in the table

# Unicorn must not overflow on left side.

Try to be as precise as you can and don’t assume that the program knows anything about the page you’re attempting to scrape.

Once you’ve written down some pseudo-code, let’s compare this to the essential code for our first scraper:

import scraperwiki

In this first section, we’re importing existing functionality from libraries — snippets of pre-written code. scraperwiki will give us the ability to download web sites, while lxml is a tool for the structured analysis of HTML documents. Good news: if you are writing a Python scraper with ScraperWiki, these two lines will always be the same.

doc_text = scraperwiki.scrape(url)

doc = html.fromstring(doc_text)

Next, the code makes a name (variable): url, and assigns the URL of the IAEA page as its value. This tells the scraper that this thing exists and we want to pay attention to it. Note that the URL itself is in quotes as it is not part of the program code but a string, a sequence of characters.

We then use the url variable as input to a function, scraperwiki.scrape. A function will provide some defined job — in this case it’ll download a web page. When it’s finished, it’ll assign its output to another variable, doc_text. doc_text will now hold the actual text of the website — not the visual form you see in your browser, but the source code, including all the tags. Since this form is not very easy to parse, we’ll use another function, html.fromstring, to generate a special representation where we can easily address elements, the so-called document object model (DOM).

In this final step, we use the DOM to find each row in our table and extract the event’s title from its header. Two new concepts are used: the for loop and element selection (.cssselect). The for loop essentially does what its name implies; it will traverse a list of items, assigning each a temporary alias (row in this case) and then run any indented instructions for each item.

The other new concept, element selection, is making use of a special language to find elements in the document. CSS selectors are normally used to add layout information to HTML elements and can be used to precisely pick an element out of a page. In this case (Line. 6) we’re selecting #tblEvents tr which will match each <tr> within the table element with the ID tblEvents (the hash simply signifies ID). Note that this will return a list of <tr> elements.

As can be seen on the next line (Line. 7), where we’re applying another selector to find any <a> (which is a hyperlink) within a <h4> (a title). Here we only want to look at a single element (there’s just one title per row), so we have to pop it off the top of the list returned by our selector with the .pop() function.

Note that some elements in the DOM contain actual text, i.e. text that is not part of any markup language, which we can access using the [element].text syntax seen on line 8. Finally, in line 9, we’re printing that text to the ScraperWiki console. If you hit run in your scraper, the smaller window should now start listing the event’s names from the IAEA web site.

You can now see a basic scraper operating: it downloads the web page, transforms it into the DOM form and then allows you to pick and extract certain content. Given this skeleton, you can try and solve some of the remaining problems using the ScraperWiki and Python documentation:

•    Can you find the address for the link in each event’s title?

•    Can you select the small box that contains the date and place by using its CSS class name and extract the element’s text?

•    ScraperWiki offers a small database to each scraper so you can store the results; copy the relevant example from their docs and adapt it so it will save the event titles, links and dates.

•    The event list has many pages; can you scrape multiple pages to get historic events as well?

As you’re trying to solve these challenges, have a look around ScraperWiki: there are many useful examples in the existing scrapers — and quite often, the data is pretty exciting, too. This way, you don’t need to start off your scraper from scratch: just choose one that is similar, fork it and adapt to your problem.

Source: http://datajournalismhandbook.org/1.0/en/getting_data_3.html

Friday, 29 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get  HTML Document text and that text then parsed to extract data from the HTML codes.  Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service.  If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

 Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library.  PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Endorsing web scraping

With more than 200 projects delivered, we stand firmly for new challenges every day. We have served above 60 clients and have won 86% of repeat business, as our main core is customer delight. Successive Softwares was approached by a client having a very exclusive set of requirements. For their project they required customised data mining, in real time to offer profitable information to their customers. Requirement stated scrapping of stock exchange data in real time so that end users can be eased in their marketing decisions. This posed as an ambitious task for us because it required processing of huge amount of data on a routine basis. We welcomed it as an event to evolve and do something aside of classic web application development.

We started with mock-ups, pursuing our very first step of IMPART Framework (Innovative Mock-up based Prototypes Analyzed to develop Reengineered Technology). Our team of experts thought of all the potential requirements with a flow and materialized it flawlessly into our mock up. It was a strenuous tasks but our excitement to do something which others still do not think of, filled our team with confidence and energy and things began to roll out perfectly. We presented our mock-up and statistics to the client as per our expectation client choose us, impressed with the efforts.

We started gathering requirements from client side and started to formulate design about the flow. The project required real time monitoring of stock exchange together with Prices, Market Turnover and then implement them into graphs. The front end part was an easy deal, we were already adept in playing with data the way required. The intractable task was to get the data. We researched and found that it can be achieved either with API or with Web Scarping and we moved with latter because of the limitations in API.

Web scraping is a compelling technique to get the required information straight out of the web page. Lack of documentation and not much forbearance forced us to make a slow start, but we kept all the requirements clear and new that we headed in the right direction.  We divided the scraping process into bits of different but related tasks. Firstly we needed to find the data which has to be captured, some of the problems faced were pagination and use of AJAX but with examination of endpoints in URL and the requests made when data is drawn, we surmounted these problems easily.

After targeting our data we focused on HTML parser which could extract data form all the targets. Using PHP we developed a parser extracting all the information and saving them in Database in a structured way.  After the required data present at our end we easily manipulated it into tables and charts and we used HIGHSTOCK for that. Entire Client side was developed in PHP with Zend frame work and we used MySQL 5.7 for server side.

During the whole development cycle our QA team insured we were delivering a quality product following all standards. We kept our client in the loop during the whole process keeping them informed about every step. Clients were also assured as they watched their project starting from scratch which developed into full fledge website. The process followed a strict time line releasing regular builds and implementing new improvements. We stood up to the expectation our client and delivered a product just as they visualized it to be.

Source: http://www.successivesoftwares.com/endorsing-web-scraping/

Monday, 25 May 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006


Friday, 22 May 2015

Roles of Data Mining in Predicting, Tracking, and Containing the Ebola Outbreak

One of the most diverse continents on earth, Africa astounds the world with its vast savannas and great deserts and with its ancient architecture and modern cities, but Africa also has its share of tragedies and woes.

First identified in Democratic Republic of Congo’s Ebola River in 1976, Ebola Hemorrhagic Fever, a deadly zoonotic disease caused by Ebola virus, has been spreading in West Africa like a wildfire, engulfing everything on its way and creating widespread panic.

What has added insult to injury is the fact that the region has long endured the severe consequences of civil wars and social conflicts, and diseases like malaria, HIV/AIDS, yellow fever, cholera etc. have remained endemic to the region for a long time, causing tens of thousands of deaths every year.

Reportedly, Ebola has already killed at least 2,296 people, and there are about 3,685 confirmed cases of infection. Mortality rate has been swinging between 50% to 90%, depending on the quality of care and nutrition. According to WHO, the disease is likely to infect as much as 20,000 people before it is finally brought under control.

Crisis of Data

When it comes to healthcare management, clinical data is one of the key components. The value of data becomes more urgent in the emergency situation like that of West Africa. The more relevant data you have, the bigger picture you can create for taking aggressive measures. To use Peter Drucker’s words, “What gets measured gets managed.”

Factual data is a precondition for the doctors and health science experts working in the field for measuring and managing the situation. Data helps them to assess their successes or failures and reorient their actions. One of the important reasons why the fight against the Ebola outbreak is turning out into a losing battle is the insufficiency of data. Recently, Scientific American magazine wrote:

Right now, there are not even enough beds for sick patients nor enough data coming in to help track cases. Surveillance and tracking of those who were possibly exposed to Ebola remain inadequate.

In Science magazine, Gretchen Vogel suggests that the death toll of Ebola patients could be much higher than it is currently estimated. She says, “Exactly how many unrecorded Ebola deaths have occurred will never be known. Health officials are keeping track of suspected and probable cases, many of which are people who died before they could be tested.” Greg Slabodkin voices similar concerns in Health Data Management and points at the need of an integrated global biosurveillance system.

The absence of reliable and actionable data has badly hampered the efforts of combatting Ebola and providing proper medical care to the victims. CDC Director Dr. Tom Frieden describes it as a “fog-of-war situation”.

Data Mining: Bots Were the First to Warn

When you flip the coin, however, the situation is not completely bleak and desperate. Even if Big Data technologies have fallen short in predicting, tracking, and containing the epidemic, mainly due to the lack of data from the ground, it has not entirely failed. Data scientists and healthcare experts world over are making concerted efforts to know, track, and defeat the Ebola virus—some on the ground and some in their labs.

The increasing level of collaboration among the biomedical specialists, geneticist, virologists, and IT experts has definitely contributed to slow down the transmission of the virulent disease dubbed as “the plague of modern day”. Médecins Sans Frontières and Healthmap.org are the excellent examples in this regard.

    “By deploying bots and crawlers and by using advanced machine learning algorithms, the Boston-based global infectious disease surveillance system, HealthMap was able to predict and raise concerns about the spread of a mysterious hemorrhagic fever in West Africa nine days earlier than WHO did.”

Run by a team of 45 researchers, epidemiologists, and software developers at Boston Children’s Hospital, HealthMap mines data from search engine queries, social media platforms, health information sites, news reports and crowd-sourced information to track the transmission of the disease and provides an up-to-date timeline report with an interactive map, making it easier for the international health agencies to devise more effective action plans.

HealthMap serves as a good example of how crucial Big Data and data mining technologies could be for handling a healthcare emergency with fact-based and data-driven decisions.

Ebola Data

In their letter to The Lancet, research scientist Rashid Ansumana and his colleagues, working on Ebola in Sierra Leone, highlighted on the need of developing epidemic surveillance systems “by adopting new data-sharing technologies.” They wrote, “Emerging technologies can help early warning systems, outbreak response, and communication between health-care providers, wildlife and veterinary professionals, local and national health authorities, and international health agencies.”

Data-Driven Initiatives to Control the Outbreak

The era of systematic use of data for making better epidemiological predictions and for finding effective healthcare solutions began with Google Flue Trends in 2007, and the rapidly developing tools, technologies, and practices in Big Data have increased the roles of data in healthcare management.

There are a number of data-driven undertakings in progress which have contributed to counter the raging spread of Ebola. Brockmann Lab, run by Professor Dirk Brockmann and his colleagues, for example, has created a computer model for studying correlations and probabilities in the explosion of new cases of infection.

World Airtraffic  Transportation and Relative Import Risk, Source: Brockmann Lab

By applying computational and statistical models, they predict which areas, cities or regions in the world are at the risk of becoming the next Ebola epidemic hotspots. Similarly, Alessandro Vespignani–a network scientist, statistical physicist, and Northeastern professor–has been using human mobility network data to track the cases of Ebola infection and dissemination.

The Swedish NGO Flowminder Foundation has been aggregating, mining, and analyzing anonymized mobile phone location data and is developing national mobility estimates for West Africa to help the local and international agencies to combat the disease.

Meanwhile, innovations with Epi Info VHF, a software tool for case management, contact tracing, analysis and reporting services for Ebola and other hemorrhagic fever outbreaks and OpenStreetMap project for getting location information and spatial data of the affected areas have further helped to guide the intervention initiatives.

However, with all optimism about the growing roles of Big Data and data mining, we also need to be mindful about their limitations. Newsweek aptly puts: “While no media-trawling bot could ever replace national and international health agencies, such tools may be starting to help fill in some of the most gaping holes in real-time knowledge.”

Source: http://www.grepsr.com/blog/data-mining-tracking-ebola-outbreak/

Wednesday, 20 May 2015

The Features of the "Holographic Meridian Scraping Therapy"

1. Systematic nature: Brief introduction to the knowledge of viscera, meridians and points in traditional Chinese medicine, theory of holographic diagnosis and treatment; preliminary discussion of the treatment and health care mechanism of scraping therapy; systemat­ic introduction to the concrete methods of the holographic meridian scraping therapy; enumerating a host of therapeutic methods of scraping for disorders in both Chinese and Western medicine to em­body a combination of disease differentiation and syndrome differen­tiation; and summarizing the health care scraping methods. It is a practical handbook of gua sha.

2. Scientific: Applying the theories of Chinese and Western medicine to explain the health care and treatment mechanism and clinical applications of scraping therapy; introducing in detail the practical manipulations, items for attention, and indications and contraindications of the scraping therapy. Here are introduced repre­sentative diseases in different clinical departments, for which scrap­ing therapy has a better curative effect and the therapeutic methods of scraping for these diseases. Stress is placed on disease differentia­tion in Western medicine and syndrome differentiation in Chinese medicine, which should be combined in practical application.

Although there are more than 140,000 kinds of disease known to modem medicine, all diseases are related to dysfunction of the 14 meridians and internal organs, according to traditional Chinese med­icine. The object of scraping therapy is to correct the disharmony in the meridians and internal organs to recover the normal bodily func­tions. Thus, the scraping of a set of meridian points can be used to treat many diseases. In the section on clinical application only about 100 kinds of common diseases are discussed, although the actual number is much more than that. For easy reference the "Index of Diseases and Symptoms" is appended at the back of the book.

3. Practical: Using simple language and plenty of pictures and diagrams to guarantee that readers can easily leam, memorize and apply the principles of scraping therapy. As long as they master the methods explained in Chapter Three, readers without any medical knowledge can apply scraping therapy to themselves or others, with reference to the pictures in Chapters Four and Five. Besides scraping therapy, herbal treatment for each disease or syndrome is explained and may be used in combination with the scraping techniques.

Referring to the Holographic Meridian Hand Diagnosis and pic­tures at the back of the book will enhance accuracy of diagnosis and increase the effectiveness of scraping therapy.

Since the first publication and distribution of the Chinese edition of the book in July 1995, it has been welcomed by both medical specialists and lay people. In March 1996 this book was republished and adopted as a textbook by the School for Advanced Studies of Traditional Chinese Medicine affiliated to the Institute of the Acu­puncture and Moxibustion of the China Academy of Traditional Chi­nese Medicine.

In order to bring this health care method to more and more peo­ple and to make traditional Chinese medicine better appreciated They have modified and replenished this book in the spirit of constant im­provement. They hope that they may make a contribution to the health care of mankind with this natural therapy which has no side-effects and causes no pollution.

They hope that the Holographic Meridian Scraping Therapy can help the health and happiness of more and more families in the world.

Source: http://ezinearticles.com/?The-Features-of-the-Holographic-Meridian-Scraping-Therapy&id=5005031

Sunday, 17 May 2015

Introducing ScrapeShield: Discover, Defend & Deter Content Scraping

If you're a publisher, whether an individual blogger or major media outlet, you've undoubtedly experienced content scraping. Searching the web for an article you've published or other original content you've created and you find it copied and republished on some other random website. Often the site will be full of ads. And, sometimes, it will even rank higher in search results than your original work.

While you may envision an army of individuals copying and pasting your content on their sites, the truth is content scraping is typically an automated process with bots that grab original content and then republish it without human intervention onto link farm sites. CloudFlare has blocked many of these bots automatically in the past, but we decided it was time to do something to more actively stop them.

Introducing ScrapeShield

ScrapeShield is an app created by the CloudFlare team. It incorporates several existing CloudFlare features like email obfuscation and hotlink protection that serve to protect from content scraping and adds a number of new features as well. Because we believe every publisher of original content should be able to understand and control how their work is used, we're providing ScrapeShield free for every CloudFlare user.

Detect, Defend & Deter

ScrapeShield has different elements to help you detect when your content is scraped, defend your site against content scrapers, and even deter content scrapers from targeting you in the first place. If you enable ScrapeShield, CloudFlare will automatically insert invisible tracking beacons in your content. When automated bots scrape your content, they pull the beacons along with them. CloudFlare detects these beacons when they ping from sites that aren't your own. You can access your ScrapeShield control panel to see where your content is being republished. Not only is this useful in showing scraping, but you can also see users who are reading your content through proxy services like Flipboard or Pulse.

The data from the content beacons is fed back into CloudFlare's protection system. As CloudFlare identifies content scraping bots, we automatically prevent them from accessing your site. Just like Project Honey Pot, the original inspiration for CloudFlare, used traps to detect when spammers were harvesting email addresses, CloudFlare now uses data from ScrapeShield to identify content scrapers and keep them off publishers' sites.

Maze

We didn't want to just stop scrapers from attacking sites on CloudFlare, we also wanted to tie up their resources so they couldn't harm the rest of the web. To do this, we created Maze. Maze routes known content scrapers who are visiting ScrapeShield-protected sites into a virtual labyrinth of gibirish and gobbledygook. We dynamically throttle the bandwidth and speed so instead of the pages loading as fast as possible, the connection is held open to the scrapers and their resources are tied up.

We use excess resources on the CloudFlare network to generate Maze, and it doesn't consume any of our publishers' resources or add any additional load to their sites. What's beautiful about the system is that the only way that content scrapers can be sure they're avoiding Maze is to avoid CloudFlare's IP addresses entirely. For any content scrapers who may be reading this, here's a helpful list of all of our IPs so you can make sure to stay away.

No Pinning

Finally, with the rise of sites like Pinterest, innocent content scraping may become even more prolific. While many sites welcome their images being pinned, we wanted to make it easy to opt out. ScrapeShield includes an option to add the no-pinning meta tag to your site to prevent your images from being pinned to the site. As other similar services include a mechanism to opt out, expect that we'll include an easy way for you to do so right from the ScrapeShield interface.

The health of the web depends on publishers creating original content getting credit for their creations. Cloud Flare is committed to building a better web and we're extremely excited about ScrapeShield as a new tool to help publishers do exactly that.

Source: https://blog.cloudflare.com/introducing-scrapeshield-discover-defend-dete/

Wednesday, 13 May 2015

Web Scraping Services Are Important Tools For Knowledge

Data extraction and web scraping techniques are important tools to find relevant data and information for personal or business use. Many companies, self-employed to copy and paste data from web pages. This process is very reliable, but very expensive as it is a waste of time and effort to get results. This is because the data collected and spent less resources and time required to collect these data are compared.

At present, several mining companies and their websites effective web scraping technique specifically for the thousands of pages of information developed culture can be traced. The information from a CSV file, database, XML file, or any other source with the required format is alameda. understanding of correlations and patterns in the data, so that policies can be designed to assist decision making. The information can also be stored for future reference.

The following are some common examples of data extraction process:

In order to rule through a government portal, citizens who are reliable for a given survey name removed.

Competitive pricing and data products include scraping websites

To access the web site or web design Stock download the videos and photos of scratching

Automatic Data Collection

It regularly collects data on a regular basis. Automated data collection techniques are very important because they find the company’s customer trends and market trends to help. By determining market trends, it is possible to understand customer behavior and predict the likelihood of the data will change.

The following are some examples of automated data collection:

Monitoring of special hourly rates for stocks

collects daily mortgage rates from various financial institutions

on a regular basis is necessary to check the weather

By using web scraping services, you can extract all data related to your business. Then analyzed the data to a spreadsheet or database can be downloaded and compared. Storing data in a database or in a required format and interpretation of the correlations to understand and makes it easier to identify hidden patterns.

Data extraction services, it is possible pricing, email, databases, profile data, and consistently to competitors for information about the data. Different techniques and processes designed to collect and analyze data, and has developed over time. Web Scraping for business processes that have beaten the market recently is one. It is a process from various sources such as websites and databases with large amounts of data provides.

Some of the most common methods used to scrape web crawling, text, fun, DOM analysis and include matching expression. After the process is only analyzers, HTML pages or meaning can be achieved through annotations. There are many different ways of scaling data, but more importantly is working toward the same goal. The main purpose of using web scraping service to retrieve and compile data in databases and web sites. In the business world is to remain relevant to the business process.

The central question about the relevance of web scraping contact. The process is relevant to the business world? The answer is yes. The fact that it is used by large companies in the world and many awards speaks derivatives.

Source: http://www.selfgrowth.com/articles/web-scraping-services-are-important-tools-for-knowledge

Friday, 8 May 2015

4 Web Scraping Tools To Save You Time On Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

4. Screen Scraper

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

4. Screen Scraper

Screen scraper is pretty neat and tackles a lot of difficult tasks including navigation and precise data extractions, however it requires a bit of programming/tokenization skills if you’d like to run it super smooth. Launch the software, add a proxy, start recording the list of your actions and creating extracting patterns (some coding required). Works great with HTML and Javascript, however you should test it with Citrix and other platforms. Basically, screen scraper helps you writing simple web scraping scripts and lets you download the extracted data in txt/csv/excel format.

Pros: When set correctly, there’s no data extraction tasks Screen scraper fails to handle.

Cons: The tool is pricey and you’ll have to go through documentation and have basic coding skills to use it.

Source: http://tech.co/4-web-scraping-tools-save-time-data-extraction-2015-03

Thursday, 30 April 2015

Earn Money From Price Comparison Through Web Scraping

Many individuals discover the pot of gold just within their reach. They have realized that there is money in the web. Cyber technology has blessed mankind with so many benefits that makes money very possible by just some clicks on the mouse and keyboard. Building a price comparison website is an effective way of helping clients find their desired products while you as the owner earn money at the same time.

Building price comparison websites

There is indeed much money in building price comparison websites but it is not an easy task especially for a novice in maintaining a website of one’s own. Since this entails some serious programming and ample familiarity with data feeds, you have to have a good working plan. In addition, what you are venturing into is greater than the usual blogs about just anything that you can think of. Furthermore, you are stepping into the vast field of electronic marketing, therefore you must be ready.

The first point of consideration is to identify which products or services are you going to include in your website. Choose a product or service that you and a majority of clients are mostly interested in. Suppose you want you to choose sports as your theme then you can include items and prices of sports gear, clothing such as uniforms, training videos, books, and other safety stuff. You need to do some research and even a survey to determine whether the goods and services you are promoting on your website are in demand and are what most people want to know. Moreover, it is on this stage that you may need the help of experts and veterans in the field of building to be assured that you are on the right track.

In addition, be willing to change in case your chosen category is not gaining readership or visitors. Then evaluate whether you need to expand or to be more specific in your description of the products and the comparison of the prices. Make your site prominent by search engine optimization (SEO) and make sure to acknowledge also that not too many people visit a site that is not free.

Helping visitors choose the best product/services

Good marketing strategy starts with knowing who your target audience are. There is indeed a need to do a lot of planning and research in order to understand your client’s needs and preferences. Moreover, knowing them thoroughly leads to achieving 100% consumer satisfaction. When you have provided everything they need to know about certain products, they would not need to seek elsewhere which will also gain you more regular visitors. Remember that your audience are members of communities and social networks such that there is a great possibility that they would spread the word around about the good services you are offering.

If there is a need to conduct a survey in addition to research, you should resort to it. In this manner you can discover what goods and services are not yet completely exhausted by the other websites or web creators. Ample knowledge about your potential visitors and consumers will surely make you effectively provide them with adequate statistics for their needs.

Your site will then look like a complete guidebook for them that will give them the best value for their money. Therefore, it must be thoroughly filled with product details, uses, options, and prices.

Making money as affiliate of eCommerce websites

Maintaining a price comparison website gives you less worry about getting paid or having your products bought and sold because income comes in through advertising and affiliate sales. Affiliate marketing is a way of earning money online by serving as a publisher for promotion of products, services or sites of businesses. The affiliate receives rewards from businesses for each visitor or client that comes to the business website or buys its product through the efforts of the advertising and promotion that is made by the affiliate. This is the online version of the concept of agent or referral fee sales channel. Aside from website owners, bloggers as well as members of community forums can also serve as affiliates. The affiliate earns money in three ways: through pay per link; pay per sale and pay per lead.

Trust in the reliability of the product - You should have a personal belief or confidence in the product you are promoting not only because it makes you sound more convincing, but also because you need to maintain your clients and establish credibility in your blog or website. In other words, don’t just pick any product. If you cannot use them personally, they should at least have several positive reviews and no negative ones.

Maintain credibility with readers and fellow bloggers - Befriend your readers and your co-bloggers by answering their queries sincerely and quickly. Your friendly attitude can win you their trust which is a very vital element of affiliate marketing.

Do reviews - In addition to publishing price comparison, you can gain more visitors by writing about the product and do proper SEO (Search Engine Optimization). So the expected happens, the more prominent the product becomes online, the higher will be your income.

Link with friends thru social media - Your friends have friends and their friends have also friends. Just think of how powerful your social media site can be when you post your link on your account on Facebook, Twitter or MySpace and others. Since trust is built on friendship, it is easy to get clients from among your friends and their friends.

Overall, you get all pertinent information about certain products through web data mining or web scraping. All you need to do is to be keen to the needs of your clients and use web content extraction efficiently.

Source: http://www.loginworks.com/earn-money-price-comparison-web-scraping/

Tuesday, 28 April 2015

Scraping a website from a windows service

Question

Hi there.  I have a windows forms application that scrapes a website to retrieve some data.  I would like to implement the same functionality as a windows service.  The reason for this is to allow the program to run 24/7 without  having a user signed in. 

To that end, my current version of the program uses a web browser control (system.windows.forms.webbrowser) to navigate the pages, click the buttons, allow scripts to do their thing, etc.  I cannot figure out a way to do the same without the web browser control, but the web browser control cannot be instantiated in a windows service (because there is no user interface in a web service).

Does anyone have any brilliant ideas on how to get around this?

Thank you very much!

Answers

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

All replies

You are not telling if you are using a .NET Express edition or not

You are not telling which Framework

You are not realy saying what data you are getting from the web site.

So

I made an example of service that work on any Studio edition (including the Express)

to install it, I supposed that you have at least the Framework2, so you will use something similar to:

    %SystemRoot%\Microsoft.NET\Framework\v2.0.50727\installutil /i C:\Test\MyWindowService\MyWindowService\bin\Release\MyWindowService.exe

In the example, I supposed that you are downloading some file from the site

You will need a reference to Windows.Form for the timer


Imports System.ServiceProcess

Imports System.Configuration.Install

Public Class WindowsService : Inherits ServiceBase

  Private Minute As Integer = 60000

  Private WithEvents Timer As New Timer With {.Interval = 30 * Minute, .Enabled = True}

  Public Sub New()

    Me.ServiceName = "MyService"

    Me.EventLog.Log = "Application"

    Me.CanHandlePowerEvent = True

    Me.CanHandleSessionChangeEvent = True

    Me.CanPauseAndContinue = True

    Me.CanShutdown = True

    Me.CanStop = True

  End Sub


  Private Sub Timer_Tick(ByVal sender As Object, ByVal e As System.EventArgs) Handles Timer.Tick

    If IO.File.Exists("C:\MyPath.Data") Then IO.File.Delete("C:\MyPath.Data")

    My.Computer.Network.DownloadFile("http://MyURL.com", "C:\MyPath.Data", "MyUserName", "MyPassword")

    'Do Something with the data downloaded

  End Sub

End Class

<Microsoft.VisualBasic.HideModuleName()> _

Module MainModule

  Public TheServiceName As String

  Public Sub main()

    Dim TheServiceApplication As New WindowsService

    TheServiceName = TheServiceApplication.ServiceName

    ServiceBase.Run(TheServiceApplication)

  End Sub

End Module

<System.ComponentModel.RunInstaller(True)> _

Public Class WindowsServiceInstaller : Inherits Installer

  Public Sub New()

    Dim serviceProcessInstaller As ServiceProcessInstaller = New ServiceProcessInstaller()

    Dim serviceInstaller As ServiceInstaller = New ServiceInstaller()

    serviceProcessInstaller.Account = ServiceAccount.LocalSystem

    serviceProcessInstaller.Username = Nothing

    serviceProcessInstaller.Password = Nothing

    serviceInstaller.DisplayName = "My Windows Service"

    serviceInstaller.StartType = ServiceStartMode.Automatic

    serviceInstaller.ServiceName = TheServiceName

    Me.Installers.Add(serviceProcessInstaller)

    Me.Installers.Add(serviceInstaller)

  End Sub

End Class



 Hello Andy,

Thanks for your post.

What do you want to scrape from the page? HttpWebRequest class ans WebClient class may be what you need. More information, please check:

The HttpWebRequest class provides support for the properties and methods defined in WebRequest and for additional properties and methods that enable the user to interact directly with servers using HTTP.

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

If you have any concenrs, please feel free to follow up.

Best regards



Hi Andy,

What about this problem on your side now? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards



Hi Andy,

When you come back, if you need further assistance about this issue, please feel free to let us know. We will continue to work with this issue.

Have a nice day.

Best regards



Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I am using Visual Studio 2010 Ultimate Edition and the .NET framework 4.0.  Actually, I am upgrading some old code written in VB 6.0, but I can use the latest and greatest thats available.

The application uses a browser control to go to the page, fill in values, click on UI elements, read the HTML that returns, etc.  The purpose of the application is to collection useful information regularily/automatically.

I know how to create a web service, but using the web control in such a service is problematic because the web browser control was meant to be placed on a windows form.  I am not able to create a new instance of it in a project designated as a windows service.



Andy

Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I thought a web request was for web services (retrieving information from them).  I am trying to retreive useful information from a website designed for interaction by a human, such as selecting items from lists and clicking buttons.   I currently use a web browser control to programmatically do what a person would do and get the pages back which in turn get parsed.

Andy



Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards



Thanks for the suggestion.  I will go to that link and see if it will work.  I will update this post with what I find.

I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions? If you have any concerns, please feel free to follow up.

Have a nice day.



Best regards

Hi Liliane

Thank for the follow up reply.  I don't have an answer as of yet.  Implementing this is going to take time and I haven't been given the go-ahead by my boss to spend the time to pursue it.



Hi Andy,

Never minde. You could have a try when you feel free. If you have any further questions about this issue, please feel free to let us know. We will continue to work with you on this issue.

Have a nice day.

Best regards

Source: https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5d565b1-236b-43c2-90c7-f5cc3b2c341b/scraping-a-website-from-a-windows-service

Sunday, 26 April 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Wednesday, 22 April 2015

SEO No No! Scraping & Splogging – Content Theft!

Until recently, you could as well as might possibly not have acknowledged how you can perform the earlier mentioned. Even so, the following element could be the really cool element.

Several. Get back to ScrapeBox Add-Ons and also down load your ScrapeBox Blog Analyzer add-on. Open it upwards, and transfer the actual .txt record you merely rescued. Struck start.

ScrapeBox goes through almost every back link you merely scraped and look these phones determine if these are your site that will ScrapeBox presently facilitates placing comments in. If it is, that turns environmentally friendly. If it isn’t, that turns reddish. Soon after it really is concluded, it is possible to “clean” the list insurance agencies the idea remove unsupported websites.

Just what you’re destined to be left with is ALL of the sites the competitor has back-links via, and most importantly, they all are capable of being mentioned in employing ScrapeBox!!

Help save that will “clean” listing with a report, import it this list involving websites you wish to touch upon, and then keep to the exact same steps you’d probably typically follow for you to touch upon websites. Inside of Ten mins you’ll have got all the comps website backlinks (which may be blocked by Public relations if you’d just like) along with you’ll be able to reply to every one of them inside a 20 min (because the list most likely won’t end up being Large).

Desire to force this specific even more?? Obviously you are doing, you’re in BHW

Each step is the same as over with the exception of one tiny issue as well as the addition of an extra step.

Instead of just employing a single foot print inside your first bounty (both from SB’s regular gui after which also the back link checker add-on) you’re likely to be using a A lot of open all of them. Here is what you do to consider this particular to a whole new amount.

Initial, you’re going to pick each of the URLs via AOL, Aol, Ask & Search engines using this footprint:

site:domainyourcompetingwith.org

That will go back ALL the at present found web pages in the area. Remove copy Web addresses along with save that will with a .TXT report.

Now, you’re planning to create the subsequent right in front of each of these URLs:

hyperlink:

Right now follow all of the steps while outlined above. Exactly what this may is actually obtain each of the backlinks to every single site of the rivals web site.

Because Google Back link Checker is simply capable of getting the first 1k Web addresses through Aol (while that’s all Google allows you to view) you could have missed out on a decent amount associated with website inbound links if they had been at night 1st 1k final results. Consequently performing the aforementioned further methods ensures that every brand new web site in the website anyone pay attention to backlinks implies a fresh and other pair of a listing of back-links that is possibly 1k back links long.

Now you understand how to locate, filtration along with take your competitors back links, stop looking at and also move and take action!

Source: https://freescrapeboxlist19.wordpress.com/

Saturday, 18 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

web-scraping-service

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s


During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

get-results-from-your-lead-generation-campaign

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/