In Progress

Squid log processing

Currently we are using customize mysar data collector (C program) to pump regular squid [url removed, login to view] data to mysql for processing. Based on mysar program reports, where are able to pump around 1000 - 1200 records per second.

This almost hit the limit. Currently we are looking for suggestion what kind of DB to use. We currently look into hadoop, MandoDB and so on, as close to mysql if possible (in term of query for data). The DB system should be able to extend to multiple server (horizontal expansion).

What we looking for this job is

1. Suggestion on what kind of DB to use + server requirement (processor + RAM)

2. New data loader (script or C program) to load default squid [url removed, login to view] into the DB. This should be high performance loader at least 10k line per second. -> we will provide a data for testing. Currently our data around 10GB-15GB per day.

Requirement - basically as mysar function as below

-> Cut the domain to up to 3 segment e.g [url removed, login to view] -> save as [url removed, login to view] except for IP

-> tag the access as cached(TCP_HIT, xxx, xxx) or not (TCP MISS, xxx, xxx) as in mysar

-> if possible able to stop and start processing in the middle of the file.

-> if possible no parameter should be saved in DB (unlike mysar). Mysar store certain parameter in DB. If can avoid is better. Only traffic info should be in DB.

-> Store the whole URL for keyword searching -> if possible. Sometime we search for certain keyword such as porn, the application will return the result as

Host, URL, Byte. This usually search for 1 particular day only.

Additional requirement (not available in mysar)

-> Translate/tag IP to zone. We have table to map ip to zone. Data in table like Zone A Zone A Zone B

We have about 20 zone. There a few ip subnet pointing to the same zone. This zone must be expandable in feature.

-> Tag to server IP. Since we have about 10 server. We need to know the log from which server.

3. Sample for php report (how to query the data)

sample -

1) Top 20 web site (URL destination) for particular month, -> Site, No of Access, No of User, Total Byte, % cache hit, % cache byte hit

2) Top 20 host for particular month -> IP, No of Site, Total Byte, % cache hit, % cache byte hit

3) Report 1 & 2 but for each zone

4) Total User, Bytes, Request, % cache hit, % cache byte for particular month

5) Report 4 for each squid server

When you bid please message me what kind of DB you wanna use.. We might need you help to tune the DB installation later on if the performance is much lower than claimed or demonstrated.

Sample access.log as below

The data loader should be complete product - can be used in production and running Freebsd 8.1
Php file can be simple php to demonstrated db query.

Skills: Big Data, Hadoop, NoSQL Couch & Mongo

See more: squid hadoop, mandodb, mysar squid log, store squid log mysql, mysar data collector, what is a horizontal line segment, top segment, server request processor, search for a destination or search the web, mysql search performance, look up translate, log into, line line segment, data collector job, application performance report, aaa com, 10gb host, what is performance testing, what is hadoop, top ram, php mongo, mongo or, hadoop job, on line processing, aaa a

About the Employer:
( 0 reviews ) Bandar Baru Bangi, Malaysia

Project ID: #4138088

Awarded to:


I have hand'son experience in building end-to-end NoSQL solutions. I have also the clusters setup for different DB's so would be able to give you give you benchmark comparisons between different DB options. I have send More

$799 USD in 13 days
(0 Reviews)

4 freelancers are bidding on average $825 for this job


Certified Hadoop developer.

$1500 USD in 14 days
(1 Review)

Hi, I'm interested that project - please check PM for more details.

$1000 USD in 14 days
(0 Reviews)

Available to start immediately and finish as soon as possible.

$500 USD in 5 days
(0 Reviews)

I have worked with data processing systems which have processed 70M transactions in 20min with multi-threading and multi-processing capabilities. Currently working on hadoop and big data technologies for anlaytics.

$500 USD in 10 days
(0 Reviews)