Web Scraper using python, scrapy, MySQL & JSON
- Status Closed
- Budget N/A
- Total Bids 24
I would like a web scraper that:
1. Retrieves a seed list of uri's from a MySQL database
2. Using multiple threads (twisted framework) and scrapy - scrapes all page for links (1 level deep only)
3. Validates the link to ensure it is a full url
4. Get the response from the scraped url (i.e. redirect, OK, not found)
4a. If no response try a DNS lookup
5. Saves the root address and response results, then import them into a MySQL table (this can be batched through a JSON file if required)
As this is being created as a proof of concept it doesn't need to be created using django unless this does not effect the price. It can be launched from a linux console.
The most important part of this project is that the scraping is made efficient by using multiple threads and by eliminating duplicate url's in step 4 to ensure the links aren't being sent requests multiple times.
This project has the potential for additional development if the right developer is found.
Note: Well commented code is expected.Get free quotes for a project like this
Looking to make some money?
- Set your budget and the timeframe
- Outline your proposal
- Get paid for your work
Hire Freelancers who also bid on this project
Looking for work?
Work on projects like this and make money from home!Sign Up Now
- The New York Times
- Wall Street Journal
- Times Online