I need a piece of software that will scrape data from a website and extract all pertinent information. All the records on it are public information (about 600k records I believe). We will need to be careful not to overload the site, too. The software will need to be able to update record data in subsequent executions (I plan on running once a month). If this needs 5-6 hours to run that is fine. For version 1, I will need to be able to filter out records where the location address is not the same as the mailing address, using some sort of logic in case both fields aren't formatted similarly. I will also need to filter by "assessed" or "taxable" value. This probably will be best served in Microsoft Access with a front end. The filtered data will then need to be able to be exported into Microsoft Excel for mail merges. The addresses will need to be formatted and ready to be merged in a way that makes it look like it wasn't just pulled from the database. I will need 2012 or 2013 assessed value, whichever is available. Please clarify with me on exact data to be extracted so we can be on the same page. I think it would be smart to create "modules" in case I would like to add new websites to be scraped in the future. The database will also need a date field for my use called "DateUpdated" for when the record was last updated/downloaded, as well as a date field called "LastExport" so I can know when I last exported it to Excel.
The website to be scraped is [url removed, login to view]
I understand that running this software and scraping may take many, many hours and will be limited by my internet connection. Maybe a multi-threaded software is a good idea?