You have chosen to sponsor your bid up to a maximum amount of .
I have .tar.gz that I think contains all the pieces you need, if you think something is missing then please let me know. The rough workflow is as follows:
init.sql will set up a database for clustering that is in the correct format. Ideally I would like to leverage my existing database on GoDaddy but I would be open to other suggestions. You will need to change the "data" table at the very bottom so that it is a view across your actual page data, which is expected to show the page id (a unique identifier) and a hash of the DOM. When you run the script you can specify a database schema, all of the tables will go in that schema.
Compile qfp.c with "gcc -o qfp qfp.c".
Run get_clusters.py, this script takes a lot of options and will allow you to customize where the database and all the tables are.
If you have done all this, congratulations, you have clusters in your database! The schema.cluster table contains the actual clusters, for each page it will have a (rep_id, page_id) pair, where the rep_id is essentially the cluster id (it is actually just the id of the lowest page in the cluster). Depending on what you want to do with the clusters, this may be all you need.
You can compile WebReport.java with "javac -cp postgresql-8.3-603.jdbc3.jar:. web_clustering/WebReport.java". You may want to make a copy of this file for your modifications, that way you can refer back to the original if you delete too much and screw something up.
If you compile and run web_clustering/WebReport.java, it will generate a web site that shows your clusters, gives screenshots of common pages in the clusters (assuming you have screenshots enabled on Neha's crawler), and lets you look at their DOMs pretty easily. You have to compile and run it from the main folder, not from within web_clustering, as it is part of the web_clustering java package. Unfortunately you will need to dig into the Java file to change things like table names, the output location, and the location of your screenshot and DOM files. These are all hard-coded and spread through multiple files, so this part will be a little time consuming. Run it with the "-M" flag and just delete any code that did not follow this execution path (there is a lot of it, he added lots of different options to this code as time went on). Then you will probably need to modify the SQL queries to grab the page data correctly, I am not certain how much work this will be though.
If you can get this to compile and run, you should be left with an output directory that contains a bunch of folders and files, one of which is "report.html". Opening this file in a web browser will give you a main page that shows your 225 most common clusters, the most common screenshots of the pages in those clusters, and have links for more information about the clusters, DOMs, etc.
Let me know if you have any questions about any of this, I would be happy to answer them. Good luck! Happy Bidding!