I'm looking for expert consulting on the feasibility (costs in terms of hardware/development) of scanning several million webpages in real-time or close to real time). Longer term employment may result if you can provide a solution that is feasible and within my budget. If you'd like this, you'd need to write a small mini-program to demonstrate the feasibilty and speed of your proposed solution. Knowledge of Python and mysql are a plus, but not required. This project may close as soon as I find someone suitable (it may not stay open the full 60 days).
Currently we have about several million html (no-rss feed) websites to scan stored in a database. I'd like to have them scanned on a real-time basis or a nearly real-time basis because the market data needs to be updated as fast as possible. I'm wondering if Erlang can do this.
Currently, these websites are stored in a database and scanned via brute force methods using approximately 1,000 IP's and three servers with xeon hex core's on each. These are all connected to a database server on the same network. With this, the scanning speed is approximatly 1-3 times every 1-2 hours depending on server load. This is too slow for our needs.