Data Miner Spider Crawler
I need PHP crawler/ spider / data miner that will:
- scan particular website
- regularly capture records from the website which list details of 900 companies which are farther divided in total of 50K sub-subpages
Preferably using PHP, MySQL, CURL or Perl CGI but other web solutions are possible too.
I need the data to be structured in logical database - MySQL. I do not want just table with 50K lines.
Spider must first check which sub-subpages are new (luckily each page has ID), which are modified (change of modified date field) and which were removed from the original website. Script must decide whether to add new record to our db, modify existing record or move record to history table.
You must understand db design, data grouping, db efficiency. I will provide you with suggested db structure. You are expected to create the logical db out of the data.
Some text formating will be needed. Removing blank spaces, removing characters etc.
Sub-subpage may contain jpg pictures. Jpg files or URLs to them are not protected. I want to have possibility to choose whether the URL of the picture will be recorded or the pictures will be downloaded to our server or none action regarding pictures will be taken.
I live in Europe (GMT+1). I'd prefer someone in close timezone reachable on ICQ.
Please bid for the data mining and logical db structuring job - phase 1. However, I would like to pick programmer (company) who could continue in further data reporting job (statistics + charts) + grouped data will be translated and I will need form (interface) for that - phase 2. Logical db must be prepared for phase 2. There will be phase 3. - graphic design and AJAX programming which can be done by someone else. Bid for phase 1 only.
I will provide you with more details by PM (high level specification, crawled URLs, schemes, suggested db structure). You must have experience in similar job. I certainly know what I want!