Web scraper crawler software

CLOSED
Bids
9
Avg Bid (USD)
$200
Project Budget (USD)
$30 - $250

Project Description:
+Start crawling from a list of the URLs specified by user;
+Supports wide range of character sets support with automated character set and language detection.Various character sets support.Provides phrase segmenting (tokenizing) for Chinese, Japanese, Korean and Thai.Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.No problem to crawl any unicode character encoding (china symbol letter, japan, korea letter,arabic, hebrew, turkish, thailand, greek, baltic, cyrillic, utf-8 windows-12xx)
+Spider picture and video source code and extract right mysql file(create tables)
+Checks website source code and returns:Site Title,Site Meta Description,Site Keywords,Site page size,Search term site url and much more
+Reasonable duplicate domain and duplicate content detection to avoid re-crawling of identical sites on different domains. (last.fm vs lastfm.com, and a million other sites that use multiple domains for the same content.)
+Understanding GET parameters, and what's a "search result" across many site-specific search engines. For example, some page may link to a search result page on another site's internal search with some GET parameters. Don't want to crawl these result pages.
+Block the unwanted contents.Proxy and cookies manage for anonymous access and cache crawled items.Effective caching gives significant time reduction in search times.HTML cleaning algorithm
+Detect broken links;(should automatically ignore broken links).Duplicate data detection and removal. Duplicate detection to stop web scraping when old data is reached.
+Crawling rules and multithreaded downloading (up to 50 threads).Can perform parallel and multi-threaded indexing for faster updating.
+Apply Regular Expressions (RegEx) on Text or HTML source of web pages and scrape the matching portion. Extract using XPath
+Update every N min - to specify how often the program will scrape the target website
+export (100;1000;10000;100000.......) results per file
+Crawled informations export to sql and mysql file(automatic mysql create table,insert into,values title,meta,keywords,page size,search term site url etc... and much more functionality in sql )

Skills required:
C Programming, C# Programming, C++ Programming, MySQL
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


$ 144
in 3 days
$ 250
in 15 days
$ 257
in 7 days
$ 250
in 3 days
$ 206
in 3 days
$ 237
in 3 days
Hire Gogamers
$ 144
in 5 days
$ 200
in 5 days
Hire threadnix
$ 111
in 10 days