I would like somebody to write a program that can parse [url removed, login to view]'s basketball play-by-play summaries into a database. I am open to parsing other sites (cbs, yahoo, etc) if they are easier to work with - as long as the data quality and data history is the same or better.
I would prefer if the program is written in Java along with SQL Server 2005 (although MySQL is fine as well). C# and Perl are also acceptable, although I would need more help setting those up on my machine.
As far as the database is concerned, I would like to be able to run fairly complex queries on the dataset - the more that can be defined on each play, the better. I put together a spreadsheet that gives you a rough idea of what I'm looking for. I didn't put a lot of thought into it, so feel free to propose a better scehma. I would say that I'm leaning toward a more compact setup as opposed to lots of connecting tables for simple queries.
I would like to be able to calculate +/- stats easily. That is, I'd like to be able to calculate what players are on the floor whenever either team scores. Any schema that makes this easier to do would be ideal. I would like the database to be loaded with stats going back through all years available on ESPN's website.
Here's an example box score :
[url removed, login to view]
I've attached an excel document with one version of how one might populate a simple one-table schema.
I'd like this project to be completed by Friday, November 20th.