The requirement is to build a process/pipeline that can take a table (literately a database table) of information about geographically located photos, and place them into meaningful but subjective groups or clusters.
There are many 'dimensions' to the data that could be used to perform the clustering, including geographical coordinates, locality (town/country etc), date taken, textual tags (Folksonomy), and photographer. There is also freeform title and description, but we've already extracted automated terms from these, so don't need to process freeform text.
All of these should/could be used to perform the clustering, eg "taken by Joe blogs in April 2012" could be a arbitrary cluster. Clustering should ideally make use of the geographical coordinates, to create clusters of nearby photos (which have some other theme - such as taken by a particular user), but not limited to it, where possible multiple dimensions should be used. The photographer is a good candidate for clustering because often a given photographer will take similar photos in the same geographical area on any given day.
It will require two modes, 1) 'priming' where a large number (over 3 million ultimately!) of photos are taken and put into clusters. and 2) 'updates' where batches of images are added (about 1000 at a time), which require placing into the existing clusters or creating new ones.
The 'update' mode should aim to where possible add to current clusters , it could delete and then recreate some clusters if how have a better fit, but also needs to be able to create new clusters where needed. In particular, it should be differential, most clusters will remain the same, only a few changing, it shouldn't just delete all the clusters and start again. The two modes are closely related, and will be largely similar probably (eg priming could just be lots of 'updates' with initially no clusters, but there could be some optimization possible to tailor for the two modes.
The aim would be to have every photo placed in one or more cluster, and ideally clusters should be somewhere on the order of 5-200 images. If a cluster grows much beyond 200 it should be a candidate for splitting. Ideally each cluster should have a label that describes it eg "photos near Reading"
If K-means or similar is used to cluster geographically, it should be an adaptive algorithm, without having to specify K. ie it works out a good number of clusters to create, not aim to create say 30 clusters. http://www.cs.uic.edu/~wilkinson/Applets/cluster.html
A sample dataset can be supplied (say a table of 120,000 images), but the 'full' data set of 3.4M images could be used too. For a tiny sample, showing the range of columns available, see http://www.nearby.org.uk/geograph/viewsample.php
It can be written in any language (PHP, Python, Java etc), but needs to be able to run fairly self contained on a Linux server. MySQL would be the ideal backing database (downloading the data from mysql, and creating the clusters in a mysql table) - but others can be considered if offer a tangible benefit (eg postgre/postgis).
The full source code - and the means to compile/run it will be required. The eventual aim would be to release the source as opensource. (keep the credit yourself, or assign it to us)
To be clear the requirement is not to come up with the perfect clustering system, as noted the clusters are subjective. But to build the framework - with a working clustering method - but so that the exact parameters can be tweaked as required.
Additional Project Description:
04/09/2013 at 16:36 BST
To clarify the closing statement, the project is to build a system capable of ingesting millions of records (over a gigabyte of raw data), and producing a large number of clusters, easily 300,000-400,000 clusters. And then on a regular basis update this with new images, creating and updating a relatively small number of those clusters.