I need someone to build me a win32 multithreaded web crawler. This crawler will primarily crawl the web based on any search keywords I entered and continue following links within sites for all downloadable pdf files.
Upon finding pdf files, it will download it, save it in a local directory. Application should optionally has the ability to execute an executable with the following few properties as the command line arguments:
- url where the file was found,
- text on the page where the pdf link was discovered, transformed html hypertext to regular text (or you could save it as a text file, and supply such filename as an argument)
- pdf file location where it is stored locally after such pdf was downloaded
Application should allow pausing, and resuming crawling via GUI. Should have a grid/listview displaying what actions it is currently doing, e.g. searching which url, downloading which pdf, etc.
This application should also log all found pdf, visited url, the number of pdf files found and downloaded in a central mysql database. Application should check against db entries see if such file had been downloaded before, if so, will skip downloading entirely. If I have multiple of this application running on different machines, they should work together and not be downloading redundant pdfs.
Application must work on XP, Vista, Windows7 both x32 and x64 OS. Please PM me which programmming language you intend to develop this application under. Please make sure application is bug free and does not have memory leaks as it is intended to be running for a long time.
Please try not to use any 3rd party components where I have to pay for licenses. If you intend to use 3rd party components, please PM me and advise what they are. Ultimately cost, reliability, and licenses for royalty distribution is major factors.
I need all source code and rights to the source and binary code in the end.
Thank you for your interest in bidding on this project. Possible follow-on projects based on satisfactory work on this project. If you have any questions, please don't hesitate to ask. Thanks.