Hi Guys,
I looking for someone to work on a scrapy project for me. I need a simple generic crawler that will start at a given domain and on each page (only within that domain) extract
->the anchor (or anything between the )
->and corresponding link
The crawler need to be able to pull the domain to start crawling from a MYSQldb and one other variable which would need to be pass back as a value when the results back to a database
It must allow for more than one spiders to be running at the same time as well as I'll have it on a cron job. It should work something like
START SCRIPT
CONNECT TO DB
SELECT FROM TABLE
WHILE(TRUE)
GET URL PLUS CORRESPONDING DOMAIN-ID VARIABLE FROM TABLE
START NEW SPIDER
LOAD URLS
EXTRACT ALL URLS AND ANCHORS FOUND ON EACH PAGE
SAVE RESULT TO DB (insert into %s set myurl, myanchor, urlid value ( url, anchor,%s domain-id)
LOOP
When each spider is done crawling I need it to update another table to say its finished
update crawldone where id = %s,domain-id
If you already have a scrapy spider running and you can modify it to do something similar that's nice as well
This can be done. It'll be better if use python script. It should use User agent which will be work like web browser. For example IE, Firefox etc. Then it will scrap data from page and insert into csv file or sqlite db or other db. That's it. I've already spiders using python, perl and php. So i think i can help you.
Hi,
I will use scrapy tool in python to complete the requirements. I have delivered many projects using Python, MYSQL successfully. Please let me know when to start on this.
Thanks