I would like a web scraper that:
1. Retrieves a seed list of uri's from a MySQL database
2. Using multiple threads (twisted framework) and scrapy - scrapes all page for links (1 level deep only)
3. Validates the link to ensure it is a full url
4. Get the response from the scraped url (i.e. redirect, OK, not found)
4a. If no response try a DNS lookup
5. Saves the root address and response results, then import them into a MySQL table (this can be batched through a JSON file if required)
As this is being created as a proof of concept it doesn't need to be created using django unless this does not effect the price. It can be launched from a linux console.
The most important part of this project is that the scraping is made efficient by using multiple threads and by eliminating duplicate url's in step 4 to ensure the links aren't being sent requests multiple times.
This project has the potential for additional development if the right developer is found.
Note: Well commented code is expected.
24 freelancers are bidding on average $167 for this job
Hi.. Expert web scraper/Data Minor here. Interested in your project. I assure you 100% accurate and good quality work. Ready to start. have a look at PM. Regards
Experienced Scrapy Developer , working on Data Scraping Domain from last more then 3 years, ready to work immediately for creating a long term relation. kindly review my Private message
Hi, I have worked more than 4 years with crawler and I'm very confident to finish things up with high quality in very short time. Please kindly check your inbox.