This is much, much simpler than a typical 'web crawler'. It needs to be run as cheaply as possible (preferably on AWS).
The software has 2 simple functions:
1. URLS: Grab a webpage (with a multi-threaded approach), these are simply pulled from the db along with the extraction class to use.
2. EXTRACTION CLASSES: Classes with ability to easily extract data from HTML, following a given pattern and insert into db. (with a multi-threaded approach)
You should follow this Perl approach and make sure your solution will garner similar, if not better results.
[url removed, login to view]
(Further reading: [url removed, login to view] )
For an experienced programer I expect this to take no longer than a day as instructions are laid out above, therefore budget is very low, bid accordingly.