Web Data Extraction

Avg Bid (USD)
Project Budget (USD)
$30 - $250

Project Description:
Scope :
Develop a system using Apache Nutch, Apache Haddop and Apache Solr to crawl the pages @100 (configurable) for given websites on round robin basis and store automatically in the particular folder on hadoop by using the name of websites.

Some websites ask for authentication i.e. User id & Password, Hence system should be capable enough to pass the user id & password dynamically at runtime by reading the information from text file or configuration (XML) file. The system should be able to store multiple user credentials and provide them in a round robin basis.

Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr with following fields.

All the documents like pdf, videos, audio, doc, docx, jpeg, png etc will be stored in folders with clear identification i.e. with url so that web page can be reconstructed from the content.

The crawling will be a focused crawling where first the meta data is extracted and passed on to a API which either passes or fails it. If passed, the whole page content is extracted and processes further. The API will be provided as a part of the project.

Solr Fields:
• Site
• Title
• Host
• Segment
• Boost
• Digest
• Time Stamp
• Url
• Site Content (Text)
• Site Content (HTML)
• Metadata (Keywords, Content)
• Metadata (Description, Content)•

Seed.txt (URLs)

Typical Steps:

1. The first step is to load the URL State database with an initial set of URLs. These can be a broad set of top-level domains such as the 1.7 million web sites with the highest US-based traffic, or the results from selective searches against another index, or manually selected URLs that point to specific, high quality pages.
2. Once the URL State database has been loaded with some initial URLs, the first loop in the focused crawl can begin. The first step in each loop is to extract all of the unprocessed URLs, and sort them by their link score.
3. Next comes one of the two critical steps in the workflow. A decision is made about how many of the top-scoring URLs to process in this loop.
4. Once the set of accepted URLs has been created, the standard fetch process begins. This includes all of the usual steps required for polite & efficient fetching, such as robots.txt processing. Pages that are successfully fetched can then be parsed.
5. Typically fetched pages are also saved into the Fetched Pages database.
6. Decision on whether page has to be crawled or not will be done based on the given object. The meta data is passed on to the object and If the given object return true then page will be crawled otherwise page will be discarded.
7. Page rank computation: Calculate the importance of page based on algorithm provided by nutch/solr
8. Once the page has been scored, each outlink found in the parse is extracted.
9. The score for the page is divided among all of the outlinks.
10. Finally, the URL State database is updated with the results of fetch attempts (succeeded, failed), all newly discovered URLs are added, and any existing URLs get their link score increased by all matching outlinks that were extracted during this loop.

Part II.

Classification of extracted pages
1. Run the pages into the classified API
2. Depending on the classification returned, store the page into that folder along with the relevance score.

Crawled pages will be stored in the respective site folders on Apache Hadoop.
Crawled page contents and metadata will be stored and indexed in Solr.

Tools and Techniques:
Apache Nutch, Solr, Apache Hadoop
Local system
Test Case:
1. check crawl data and xml file in respective folders.
2. Search query parameter in xml and text files.

Skills required:
Web Scraping
About the employer:
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.

$ 206
in 5 days
$ 144
in 3 days
$ 147
in 3 days