Squid log processing

IN PROGRESS
Bids
5
Avg Bid (USD)
$860
Project Budget (USD)
$250 - $750

Project Description:
Currently we are using customize mysar data collector (C program) to pump regular squid [url removed, login to view] data to mysql for processing. Based on mysar program reports, where are able to pump around 1000 - 1200 records per second.

This almost hit the limit. Currently we are looking for suggestion what kind of DB to use. We currently look into hadoop, MandoDB and so on, as close to mysql if possible (in term of query for data). The DB system should be able to extend to multiple server (horizontal expansion).

What we looking for this job is

1. Suggestion on what kind of DB to use + server requirement (processor + RAM)
2. New data loader (script or C program) to load default squid [url removed, login to view] into the DB. This should be high performance loader at least 10k line per second. -> we will provide a data for testing. Currently our data around 10GB-15GB per day.

Requirement - basically as mysar function as below
-> Cut the domain to up to 3 segment e.g [url removed, login to view] -> save as [url removed, login to view] except for IP
-> tag the access as cached(TCP_HIT, xxx, xxx) or not (TCP MISS, xxx, xxx) as in mysar
-> if possible able to stop and start processing in the middle of the file.
-> if possible no parameter should be saved in DB (unlike mysar). Mysar store certain parameter in DB. If can avoid is better. Only traffic info should be in DB.
-> Store the whole URL for keyword searching -> if possible. Sometime we search for certain keyword such as porn, the application will return the result as
Host, URL, Byte. This usually search for 1 particular day only.


Additional requirement (not available in mysar)
-> Translate/tag IP to zone. We have table to map ip to zone. Data in table like
172.30.10.0/23 Zone A
172.30.12.0/23 Zone A
172.30.14.0/23 Zone B

We have about 20 zone. There a few ip subnet pointing to the same zone. This zone must be expandable in feature.

-> Tag to server IP. Since we have about 10 server. We need to know the log from which server.

3. Sample for php report (how to query the data)
sample -
1) Top 20 web site (URL destination) for particular month, -> Site, No of Access, No of User, Total Byte, % cache hit, % cache byte hit
2) Top 20 host for particular month -> IP, No of Site, Total Byte, % cache hit, % cache byte hit
3) Report 1 & 2 but for each zone
4) Total User, Bytes, Request, % cache hit, % cache byte for particular month
5) Report 4 for each squid server

Additional Project Description:
01/20/2013 at 9:00 IST
When you bid please message me what kind of DB you wanna use.. We might need you help to tune the DB installation later on if the performance is much lower than claimed or demonstrated.

Sample access.log as below

http://v.netboxs.com/access.log.bz2

The data loader should be complete product - can be used in production and running Freebsd 8.1
Php file can be simple php to demonstrated db query.

Skills required:
Big Data, Hadoop, NoSQL Couch & Mongo
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


Hire ngcomp
$ 1500
in 14 days
Hire varun580
$ 799
in 13 days
Hire glmanoj
$ 500
in 10 days
Hire zeke
$ 500
in 5 days
$ 1000
in 14 days