You have chosen to sponsor your bid up to a maximum amount of .
Webcrawler / Spider - Data Extraction
We need a webcrawler / spider that can collect the technical specifications of a particular product
•In essence we will want to input a name and or model number of a particular product and the spider should extract the technical specifications from multiple websites (10-20), you may want to query Google first for the top 10-20 results and then crawl those sites. The number of product could range from 100 to 1000's at a time and we should be able to upload the list with a csv or similar.
•The next step in the process is some level of “fuzzy logic” that will compare the specification names/fields and identify a tolerable level of similarity between the different results and that will be the field label for that particular feature/ specification. i.e. there are generally key technical specifications always mentioned for a particular type of product for example: megapixels for digital cameras.
•The next step is to apply similar same fuzzy logic for the actual specifications themselves as often webmasters don’t always post data accurately or completely and leave some specs out.
•All the data should then be stored in a database that is searchable. The data should be presented in a tabular format.
•Where possible the pdf’s with the technical specifications and or user manuals of the said product, a URL should be supplied by the application, the source URL’s of the data should be included as well
•Our preference is for a web based solution using open source such as php and mySql . The application must be secure and scalable.
•We will require a web based front end to display the results to users, so integration into a CMS such as Wordpress or Joomla would be preferable.
We have many ideas of the logical flow of achieving the above as well as the bigger picture to this entire project, however this will be shared with those short listed as potential suppliers. The code must belong to us and you must be prepared to sign a NDA.
This is the initial project and based on the success of the project there will be ongoing enhancements and features required. Please make sure to read the above properly and send through any questions you have as well as constructive responses.