Statistical Web Intelligence

This project received 15 bids from talented freelancers with an average bid price of $687 USD.

Get free quotes for a project like this
Employer working
Project Budget
Total Bids
Project Description

to evaluate one or more ways of encoding unstructured text so that sensible reasoning can be done about page contents or the relationships between different web pages

An increasing amount of "web intelligence" research ideas (as well as existing applications) depend on being able to reason about the content of web pages based purely on the statistics of the words contained in them. Understanding pages based on natural language processing is extraordinarily difficult and has so far had only minor success in the domain of unstructured free text. However understanding whether two different web pages are about similar topics *can* be done, based on "bag of words" statistics. There are lots of research issues here, and lots of unanswered questions. Projects in this line will address these issues and questions. In all cases it is likely that the student will need to be able to write a basic simple parser that can find all words in the page and their frequencies. Hence, any given web page can be converted to a real-number vector.
(1) find good ways to visualise a set of pages in two dimensions, based on a Self-Organising map, or by using a genetic algorithm to optimise the clustering of the pages.
(2) build vectors based on *pairs* of words rather than single words, which may lead to better clustering of pages.
(3) investigate the accuracy of a variety of machine learning methods (e.g. decision trees, and/or a variety of things which can be implemented easily by downloading weka) for classifying pages into categories based on their vector encodings.
(4--10) many other possibilities.

For test data we will use categorised sets of pages from [url removed, login to view] and/or similar

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online