Website scraper

CLOSED
Bids
20
Avg Bid (USD)
$1188
Project Budget (USD)
$750 - $1500

Project Description:
We are looking for a scraper to extract main text content from the site and ignore things like ads, menus, etc. The scraper should be a generic one such that we do not need to alter it in any way for each additional website. The goal is to build a scraper that is both smart and efficient. The details of the implementation will be up to the developer. But on a high level, we have a number of requests and recommendations regarding architecture and minimum requirements.

The main requirement is that we should be able to give this scraper almost any URL to an HTML/XHTML-based website and have it automatically scrape the entire site.

Relevant Pages
While doing so, it needs to understand which pages are relevant (contain primary/useful content), which pages are secondary (e.g. “about us”, “terms of service”, “careers”, etc), and which are completely irrelevant. Most importantly, we need to know which ones are relevant – basically to filter out everything else.

Information Architecture
The scraper needs to make note of the hierarchy of data on the website and how it is organized. For example, it should take into account any hierarchical categorization and how pages on the site fit into those categories. Keep in mind that a page could be associated with multiple categories, and there could be more than one kind of taxonomy per site.

Parsing Page Content
The scraper should determine which parts of the webpage are actual unique content, what the page title is, and what all the “irrelevant” portions are (sidebars, headers, footers, ads, miscellaneous widgets). This is highly important because we want to save only the title and unique page content in our database (and of course categories that it fits into, etc). In other words, we want to filter out sidebars, headers, footers, ads, etc.

Suggested Architecture
We have a recommendation about how to parse page content as described above – to determine which parts of the webpage are unique and which should be filtered out (sidebars, headers, footers, ads, widgets). It would probably make most sense to maintain a hierarchical object of the DOM so that every node in the HTML (and the HTML within it) would have its own object inside this DOM array/object, recursively. Basically the equivalent of http://simplehtmldom.sourceforge.net (except we don’t want to use PHP because it’s very inefficient at parsing). Then our goal is to determine which content is unique on each page, and which content reoccurs regularly on other pages (sidebars, headers, etc). One way to achieve this is to generate an MD5 hash for the HTML contained within each HTML node and create a database of these hashes (and how many times each of them occurs). Then we can look at each hash on the current page and compare it with all existing hashes in the database to see if this is something unique or commonly reoccurring. More specifically, when generating MD5s (if that’s the method we use), we should strip from the HTML tags most of their attributes because things like “class”, “id”, “style”, etc might appear in menus of the site and could actually vary from page to page such as when needing to mark a menu item as current/ selected. So to avoid these irregularities, we want to look at mainly the structure of the DOM and the content inside it, ignoring many or most of the HTML attributes inside tags. However, this is just a suggestion, and other ideas are welcome if you can think of something better. In terms of architectural requirements, please remember that performance is a very high priority for us.

The choice of programming language is largely up to the developer. However, we highly recommend using a lower level language like C, C++, or maybe something a bit higher level like Java or C#. PHP, Ruby, etc probably wouldn't work. We need as much power as we can get.

Skills required:
C Programming, C++ Programming, Java
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


$ 1443
in 15 days
$ 1500
in 3 days
$ 773
in 10 days
$ 1000
in 7 days
$ 842
in 7 days
$ 1500
in 30 days
Hire artursharipov
$ 1444
in 15 days
$ 1159
in 15 days
$ 1350
in 25 days
Hire indianpws
$ 888
in 10 days