You have chosen to sponsor your bid up to a maximum amount of .
We are looking for a scraper to extract main text content from the site and ignore things like ads, menus, etc. The scraper should be a generic one such that we do not need to alter it in any way for each additional website. The goal is to build a scraper that is both smart and efficient. The details of the implementation will be up to the developer. But on a high level, we have a number of requests and recommendations regarding architecture and minimum requirements.
The main requirement is that we should be able to give this scraper almost any URL to an HTML/XHTML-based website and have it automatically scrape the entire site.
While doing so, it needs to understand which pages are relevant (contain primary/useful content), which pages are secondary (e.g. “about us”, “terms of service”, “careers”, etc), and which are completely irrelevant. Most importantly, we need to know which ones are relevant – basically to filter out everything else.
The scraper needs to make note of the hierarchy of data on the website and how it is organized. For example, it should take into account any hierarchical categorization and how pages on the site fit into those categories. Keep in mind that a page could be associated with multiple categories, and there could be more than one kind of taxonomy per site.
Parsing Page Content
The scraper should determine which parts of the webpage are actual unique content, what the page title is, and what all the “irrelevant” portions are (sidebars, headers, footers, ads, miscellaneous widgets). This is highly important because we want to save only the title and unique page content in our database (and of course categories that it fits into, etc). In other words, we want to filter out sidebars, headers, footers, ads, etc.
We have a recommendation about how to parse page content as described above – to determine which parts of the webpage are unique and which should be filtered out (sidebars, headers, footers, ads, widgets). It would probably make most sense to maintain a hierarchical object of the DOM so that every node in the HTML (and the HTML within it) would have its own object inside this DOM array/object, recursively. Basically the equivalent of http://simplehtmldom.sourceforge.net (except we don’t want to use PHP because it’s very inefficient at parsing). Then our goal is to determine which content is unique on each page, and which content reoccurs regularly on other pages (sidebars, headers, etc). One way to achieve this is to generate an MD5 hash for the HTML contained within each HTML node and create a database of these hashes (and how many times each of them occurs). Then we can look at each hash on the current page and compare it with all existing hashes in the database to see if this is something unique or commonly reoccurring. More specifically, when generating MD5s (if that’s the method we use), we should strip from the HTML tags most of their attributes because things like “class”, “id”, “style”, etc might appear in menus of the site and could actually vary from page to page such as when needing to mark a menu item as current/ selected. So to avoid these irregularities, we want to look at mainly the structure of the DOM and the content inside it, ignoring many or most of the HTML attributes inside tags. However, this is just a suggestion, and other ideas are welcome if you can think of something better. In terms of architectural requirements, please remember that performance is a very high priority for us.
The choice of programming language is largely up to the developer. However, we highly recommend using a lower level language like C, C++, or maybe something a bit higher level like Java or C#. PHP, Ruby, etc probably wouldn't work. We need as much power as we can get.