Web Data Extraction - repost


Scope :

Develop a system using Apache Nutch, Apache Haddop and Apache Solr to crawl the pages @100 (configurable) for given websites on round robin basis and store automatically in the particular folder on hadoop by using the name of websites.

Some websites ask for authentication i.e. User id & Password, Hence system should be capable enough to pass the user id & password dynamically at runtime by reading the information from text file or configuration (XML) file. The system should be able to store multiple user credentials and provide them in a round robin basis.

Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr with following fields.

All the documents like pdf, videos, audio, doc, docx, jpeg, png etc will be stored in folders with clear identification i.e. with url so that web page can be reconstructed from the content.

The crawling will be a focused crawling where first the meta data is extracted and passed on to a API which either passes or fails it. If passed, the whole page content is extracted and processes further. The API will be provided as a part of the project.

Solr Fields:

• Site

• Title

• Host

• Segment

• Boost

• Digest

• Time Stamp

• Url

• Site Content (Text)

• Site Content (HTML)

• Metadata (Keywords, Content)

• Metadata (Description, Content)•


[url removed, login to view] (URLs)

Typical Steps:

1. The first step is to load the URL State database with an initial set of URLs. These can be a broad set of top-level domains such as the 1.7 million web sites with the highest US-based traffic, or the results from selective searches against another index, or manually selected URLs that point to specific, high quality pages.

2. Once the URL State database has been loaded with some initial URLs, the first loop in the focused crawl can begin. The first step in each loop is to extract all of the unprocessed URLs, and sort them by their link score.

3. Next comes one of the two critical steps in the workflow. A decision is made about how many of the top-scoring URLs to process in this loop.

4. Once the set of accepted URLs has been created, the standard fetch process begins. This includes all of the usual steps required for polite & efficient fetching, such as [url removed, login to view] processing. Pages that are successfully fetched can then be parsed.

5. Typically fetched pages are also saved into the Fetched Pages database.

6. Decision on whether page has to be crawled or not will be done based on the given object. The meta data is passed on to the object and If the given object return true then page will be crawled otherwise page will be discarded.

7. Page rank computation: Calculate the importance of page based on algorithm provided by nutch/solr

8. Once the page has been scored, each outlink found in the parse is extracted.

9. The score for the page is divided among all of the outlinks.

10. Finally, the URL State database is updated with the results of fetch attempts (succeeded, failed), all newly discovered URLs are added, and any existing URLs get their link score increased by all matching outlinks that were extracted during this loop.

Part II.

Classification of extracted pages

1. Run the pages into the classified API

2. Depending on the classification returned, store the page into that folder along with the relevance score.


Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr.

Tools and Techniques:

Apache Nutch, Solr, Apache Hadoop

Local system

Test Case:

1. check crawl data and xml file in respective folders.

2. Search query parameter in xml and text files.

Skills: Web Scraping

See more: DECISION, pdf doc, web searches database, web scraping techniques, web scraping process, web scraping part time, web scraping api, web content classification, top segment, tools to develop websites, text search algorithm, text matching algorithm, sort algorithm, scraping web content, scraping tools web, results focused, png to txt, index search algorithm, indexed data, increased web traffic, increased web site traffic, increased site traffic web, importance of algorithm, how to develop web content, how is web page created

Project ID: #5190835

4 freelancers are bidding on average $753 for this job


Hi I just offer to implement the same requirements as a desktop application in C#.. Let me know if you are interested. Thanks

$789 USD in 3 days
(8 Reviews)

Dear Sir, quality and expert researcher here to search a targeted group of people, images, email addresses, database or anything that you are needed you may hire me in full confidence.

$555 USD in 10 days
(13 Reviews)

Dear Sir, I am extremely interested to your project and ready to start immediately. i will give you 100% quality work. pls response me. i am waiting for your response.My skype id- cute.nasrin1. Thanks...

$555 USD in 100 days
(4 Reviews)

Hi, I've worked 5 years with Nutch/Hadoop/Solr/Lucene and have much experience. I've built many applications with Nutch in the Hadoop system like the search engine, language processing,.... I've built the language pro More

$1111 USD in 10 days
(0 Reviews)

Hi we are freelance software developers, if you contact me at our website we can discuss the details of the project. w w w . sol v e r . i o

$555 USD in 3 days
(0 Reviews)