First of all: This should be programmed using ANSI C that compiles in GCC should be cross platform.
We need a Function that will take a web URL and download the pages html contents. (it should not download any pictures or any other external files) It should then come up with a title, description and keywords based on the meta tags. If ther are no meta tags, the title, keywords and descriptions should be be figured out like google or yahoo- in that it will ignore common words like 'a', 'the', and many others. It should also drop words that have been repeated to many times (more then 7 I think). It should also attempt to figure out the last time the page was modified - if it can't it should compare it with an internal date in the database- and store in the database only if newer. The URL, Title, Description and keywords should be saved in a database called "sites.dat" using a database function we have had developed for us.
At any point that it receives an error 301 (or any other redirect method) it should follow the link then update the URL that was passed in.
If there is a 404 or any other error preventing the page from being downloaded it should return all blank values.
Any links that it finds should be stored using a database function that we are having developed using the filename "links.dat".
This function should obey all ROBOT tags, as well as [url removed, login to view] files.
When this is being coded, you should be aware that not all sites have perfect HTML and some tags will be wrong or full of errors. Count on this function looking at badly formed html sites.
In most cases, this should act no differently as a googlebot. Though when downloading a page it should identify itself as 'dCrawler'.
11 freelancers are bidding on average $251 for this job
We are Web development,Search Engine Optimisation and BPO company from India . Kindly go through our url http://www.infiniteoutsourcing.com/ . We are interested in your project. Thanks. Regards, Anshu