Web Crawler / Scrapper for travel blogs and information


I need a crawler/scrapper to scrap information from all the available channels for travel information.

1. The crawler to be able to take inputs

-- Keyword list for travel specific words

-- Location specific list For eg . Maharastra, India or Pune, Maharashtra , India. This can be multiple

-- Specific website wildcard list if any else search the whole web

2. The output has to be

-- My SQL table

-- The URL of the crawled website

-- Count of the Location specific keyword found in the URL content

-- Count of the travel specific keyword found in the URL content

-- Author name and contact, email id of the article if available / Social media contact if available

-- Heading of the article

-- Crawled/ Scraped status

-- Text file

-- Content of the article with all the details

-- Text file name to be tagged with the URL ID generated in the My SQL table for that record

3. The crawler to crawl till the end of the URL tree and to be searched in the content

4. The crawled URLs to be excluded when crawler is started again.

5. A simple interface to start the crawler once the inputs are uploaded in the server . This can be done manually

6. The program to be run in a hosted server.

I will not be available for discussions during the bidding. Any updates will be posted on the Message board. And the freelancer will not be selected if the criteria is not met.

Skills: MySQL, PHP, Web Scraping

See more: scrap, search location, website freelancer in pune, web search the social web, web scrapper freelancer, web pune, web freelancer india, web-crawler, web 2 blogs, url for freelancer website, travel freelancer in india, travel freelancer, start text message in freelancer, social media freelancer pune, search for author, scraping web content, scrap information from web, name of any article author, media freelancer india, list of freelancer in india, is freelancer available in india, india web freelancer, india freelancer web, freelancer web india, freelancer web content

Project ID: #4519341