Closed

Website scraper

We are looking for a scraper to extract main text content from the site and ignore things like ads, menus, etc. The scraper should be a generic one such that we do not need to alter it in any way for each additional website. The goal is to build a scraper that is both smart and efficient. The details of the implementation will be up to the developer. But on a high level, we have a number of requests and recommendations regarding architecture and minimum requirements.

The main requirement is that we should be able to give this scraper almost any URL to an HTML/XHTML-based website and have it automatically scrape the entire site.

Relevant Pages

While doing so, it needs to understand which pages are relevant (contain primary/useful content), which pages are secondary (e.g. “about us”, “terms of service”, “careers”, etc), and which are completely irrelevant. Most importantly, we need to know which ones are relevant – basically to filter out everything else.

Information Architecture

The scraper needs to make note of the hierarchy of data on the website and how it is organized. For example, it should take into account any hierarchical categorization and how pages on the site fit into those categories. Keep in mind that a page could be associated with multiple categories, and there could be more than one kind of taxonomy per site.

Parsing Page Content

The scraper should determine which parts of the webpage are actual unique content, what the page title is, and what all the “irrelevant” portions are (sidebars, headers, footers, ads, miscellaneous widgets). This is highly important because we want to save only the title and unique page content in our database (and of course categories that it fits into, etc). In other words, we want to filter out sidebars, headers, footers, ads, etc.

Suggested Architecture

We have a recommendation about how to parse page content as described above – to determine which parts of the webpage are unique and which should be filtered out (sidebars, headers, footers, ads, widgets). It would probably make most sense to maintain a hierarchical object of the DOM so that every node in the HTML (and the HTML within it) would have its own object inside this DOM array/object, recursively. Basically the equivalent of [url removed, login to view] (except we don’t want to use PHP because it’s very inefficient at parsing). Then our goal is to determine which content is unique on each page, and which content reoccurs regularly on other pages (sidebars, headers, etc). One way to achieve this is to generate an MD5 hash for the HTML contained within each HTML node and create a database of these hashes (and how many times each of them occurs). Then we can look at each hash on the current page and compare it with all existing hashes in the database to see if this is something unique or commonly reoccurring. More specifically, when generating MD5s (if that’s the method we use), we should strip from the HTML tags most of their attributes because things like “class”, “id”, “style”, etc might appear in menus of the site and could actually vary from page to page such as when needing to mark a menu item as current/ selected. So to avoid these irregularities, we want to look at mainly the structure of the DOM and the content inside it, ignoring many or most of the HTML attributes inside tags. However, this is just a suggestion, and other ideas are welcome if you can think of something better. In terms of architectural requirements, please remember that performance is a very high priority for us.

The choice of programming language is largely up to the developer. However, we highly recommend using a lower level language like C, C++, or maybe something a bit higher level like Java or C#. PHP, Ruby, etc probably wouldn't work. We need as much power as we can get.

Skills: C Programming, C++ Programming, Java

See more: what to look for in a ruby, what programming language is this, what kind of careers are there, what is the most useful programming language, what is ruby programming, what is recursively, what is power up, what is data structure in programming, what is an array in programming, what is a method in programming, what is a high level programming language, what is a class in programming, website programming language, website of programming language, use of data structure in programming, think recursively, the ruby programming language, style careers, ruby want ads, ruby programming language, recursively, recommend a website developer, programming what is a class, programming website developer, programming ruby

About the Employer:
( 0 reviews ) United States

Project ID: #5049946

16 freelancers are bidding on average $1184 for this job

samitXI

Hi Sir, I am ready to work for you.I have 9 years of experience in C/C++ , java . please see some of my works also check my reviews you will get better idea about my skill.I deliver quality work within time frame. P More

$1443 USD in 15 days
(139 Reviews)
7.0
szymszteinsl

Hi! I am professional C/C++/C#/Java programmer. I can do this project with highest quality. Best regards, Szymszteinsl

$1500 USD in 3 days
(29 Reviews)
6.0
SigmaVisual

Dear Client, I can help in your project. We have already experience of working on similar projects. Please see below to get idea of our experience: Amazon/Ebay Bots: [url removed, login to view] More

$773 USD in 10 days
(20 Reviews)
5.9
nani01029x

I have done some project in Web scraper got some positive feedbacks from clients. I have very high Completion Rate. You can check my profile for more information. Let me help you. Tinh Nguyen.

$1000 USD in 7 days
(69 Reviews)
5.4
wbslivera

Hello, I am Oracle certified professional java programmer and have [url removed, login to view] and I have done many scrapers before. I use java and selenium/htmlunit/jsoup technologies together for scrappers, previously I have created scra More

$842 USD in 7 days
(39 Reviews)
4.9
dimplex

HI, Thank you for considering my bid. Based on my experience with YP across various countries, I can offer a proven pattern that solves a number of problems not mentioned here. The language is Java and I'd recommend More

$1500 USD in 30 days
(22 Reviews)
4.5
artursharipov

Hi, I'm a professional software developer with machine learning skills. I've created several generic web site scrapers and know what you want to achieve. I suggest to apply machine learning, in this case the scraper More

$1444 USD in 15 days
(7 Reviews)
4.3
irshadwazid

Hello, Thanks, for giving us a chance to bid on your project. Please check private message box for more details. We are WebGloabalIt is an (ISO 9001:2008) (ISO 27001:2005) certified company who have throught 5 More

$1159 USD in 15 days
(2 Reviews)
2.9
TIWORLD

Hi, We are very clear with the specification mentioned and ready to start your project immediately. We are pleased about having the opportunity to work together. Sir , TI World is Indian base company. You can see o More

$1350 USD in 25 days
(2 Reviews)
2.6
indianpws

Hi, I am an IT professional with more than 15 years of experience. I am a SCJP (Sun Certified Java Professional), OCEJWCD (Oracle Certified Enterprise Java Web Components Developer), SCEA (Sun Crtified Enterprise Ar More

$888 USD in 10 days
(2 Reviews)
2.2
rameshsharma644

Hello There, Greetings to you from Globussoft, We are the masters of Scraper development and have over 350+ scrapers in our product list. We can definitely help you in your scraper directory development. We More

$1250 USD in 7 days
(0 Reviews)
0.0
kohlivirat511

Hello There, Greetings to you from Globussoft, We are the masters of Scraper development and have over 350+ scrapers in our product list. We can definitely help you in your scraper directory development. We More

$1111 USD in 30 days
(0 Reviews)
0.0
rohithr1990

I have already done something like this in java using Simplescrap library.I can modify the code according to your problem statement,I also have experience in scrapping using python with scrapy framework.

$1250 USD in 12 days
(0 Reviews)
1.0
mandarfreelancer

Hello Excellent description! All a programmer needs is a precise specification like you have given. Let me discuss key points that are relevant. C / C++ FOR EFFICIENCY Although we can use higher level language More

$833 USD in 30 days
(0 Reviews)
0.0
kishorsahu300

Hello There, Greetings to you from Globussoft, We are the masters of Scraper development and have over 350+ scrapers in our product list. We can definitely help you in your scraper directory development. We More

$1111 USD in 8 days
(0 Reviews)
0.0
skivsoft

Hi! This is quite interesting problem. I can implement requested features as multi-threaded application in C# or Java. Best regards, Skiv

$1500 USD in 30 days
(0 Reviews)
0.0
Askarali82

Hi, I am a C++ programmer. I may try to write your web scraper in C++ using socket programming. I haven't written a scraper but I have experience in socket programming. I wrote Windows service using sockets. So More

$1333 USD in 30 days
(0 Reviews)
0.0
SonDangHong

Hello I has a PHP similar your requirement. Please see 2 link below 1. [url removed, login to view]://fileom.com/th65mfiptl7z/Daphne_setup-Downloader.exe.html&landing=[url removed, login to view] More

$1388 USD in 20 days
(0 Reviews)
0.0
omanasoft

APPLICATION: The application will implement scraping and extraction of relevant-information, from multiple Web-sites of different genres, use the data to populate database. Relevance to be determined on the basis of More

$1333 USD in 20 days
(0 Reviews)
0.0
damo303030

Hi, I am a Java and C# expert. I will have no problem developing a web scrapper to extract data you require. I can complete the project in less than 15 days. Let me know if you want to discuss requirements further.

$750 USD in 15 days
(0 Reviews)
0.0