Short version:
We're looking for a SQUID 3 content-filter extension. It should be connected via ICAP and is therefore an ICAP Server. This ICAP server should be able to detect the language of an text or html docment and check against a MySQL database if the specific language is allowed for a specific user. Of course this functionality doesn't have to come from scratch - GPL code is allowed (some usefull projects listed below), nevertheless LGPL or similar preferred.
Long version:
In this setup the SQUID proxy server version 3 will function as an ICAP client and will feed the requested ICAP server all data that is coming from the web. The ICAP server will have to take a closer look at the text/* content-types. Here only the body is of interest. All html-tags should be striped away (please check existing libraries) and 200 to 500 chars of the remaining content should be used to guess the language (see the menioned libtextcat project below). If it's not possible to detect/guess the language the language is 'unkown'. After that the guessed language should be compared with the user specific allowed languages. If the language is not allowed the ICAP server should send a HTTP Redirect header to a configurable url with the language and the URL urlencoded as parameter. The user can be identified by ip address. The allowed languages can be retrieved via a mysql query. The connection should be made at startup and the reconnect should take place if the link gets lost. The allowed languages per user should be cached for a configurable amount of time. Please find the database design below. The ICAP server should be written in C and should be multi-threaded (see the mentioned c-icap project). The MySQL connection parameters, the redirect url as well as the cache time schould be read from a configuration file.
The source code should be well documented and performance/throughput is very important.
Possible usefull projects (optional):
C-ICAP:
[login to view URL]
Language Guessing:
[login to view URL]
MySQL:
libmysqlclient
Database design:
users
user_id, ip_address
languages
language_id, language_name
allowed_languages
user_id, language_id