You have chosen to sponsor your bid up to a maximum amount of .
to evaluate one or more ways of encoding unstructured text so that sensible reasoning can be done about page contents or the relationships between different web pages
An increasing amount of "web intelligence" research ideas (as well as existing applications) depend on being able to reason about the content of web pages based purely on the statistics of the words contained in them. Understanding pages based on natural language processing is extraordinarily difficult and has so far had only minor success in the domain of unstructured free text. However understanding whether two different web pages are about similar topics *can* be done, based on "bag of words" statistics. There are lots of research issues here, and lots of unanswered questions. Projects in this line will address these issues and questions. In all cases it is likely that the student will need to be able to write a basic simple parser that can find all words in the page and their frequencies. Hence, any given web page can be converted to a real-number vector.
(1) find good ways to visualise a set of pages in two dimensions, based on a Self-Organising map, or by using a genetic algorithm to optimise the clustering of the pages.
(2) build vectors based on *pairs* of words rather than single words, which may lead to better clustering of pages.
(3) investigate the accuracy of a variety of machine learning methods (e.g. decision trees, and/or a variety of things which can be implemented easily by downloading weka) for classifying pages into categories based on their vector encodings.
(4--10) many other possibilities.
For test data we will use categorised sets of pages from www.dmoz.org and/or similar