Currently we are using customize mysar data collector (C program) to pump regular squid [url removed, login to view] data to mysql for processing. Based on mysar program reports, where are able to pump around 1000 - 1200 records per second.
This almost hit the limit. Currently we are looking for suggestion what kind of DB to use. We currently look into hadoop, MandoDB and so on, as close to mysql if possible (in term of query for data). The DB system should be able to extend to multiple server (horizontal expansion).
What we looking for this job is
1. Suggestion on what kind of DB to use + server requirement (processor + RAM)
2. New data loader (script or C program) to load default squid [url removed, login to view] into the DB. This should be high performance loader at least 10k line per second. -> we will provide a data for testing. Currently our data around 10GB-15GB per day.
Requirement - basically as mysar function as below
-> Cut the domain to up to 3 segment e.g [url removed, login to view] -> save as [url removed, login to view] except for IP
-> tag the access as cached(TCP_HIT, xxx, xxx) or not (TCP MISS, xxx, xxx) as in mysar
-> if possible able to stop and start processing in the middle of the file.
-> if possible no parameter should be saved in DB (unlike mysar). Mysar store certain parameter in DB. If can avoid is better. Only traffic info should be in DB.
-> Store the whole URL for keyword searching -> if possible. Sometime we search for certain keyword such as porn, the application will return the result as
Host, URL, Byte. This usually search for 1 particular day only.
Additional requirement (not available in mysar)
-> Translate/tag IP to zone. We have table to map ip to zone. Data in table like
172.30.10.0/23 Zone A
172.30.12.0/23 Zone A
172.30.14.0/23 Zone B
We have about 20 zone. There a few ip subnet pointing to the same zone. This zone must be expandable in feature.
-> Tag to server IP. Since we have about 10 server. We need to know the log from which server.
3. Sample for php report (how to query the data)
1) Top 20 web site (URL destination) for particular month, -> Site, No of Access, No of User, Total Byte, % cache hit, % cache byte hit
2) Top 20 host for particular month -> IP, No of Site, Total Byte, % cache hit, % cache byte hit
3) Report 1 & 2 but for each zone
4) Total User, Bytes, Request, % cache hit, % cache byte for particular month
5) Report 4 for each squid server
When you bid please message me what kind of DB you wanna use.. We might need you help to tune the DB installation later on if the performance is much lower than claimed or demonstrated.
Sample access.log as below
The data loader should be complete product - can be used in production and running Freebsd 8.1
Php file can be simple php to demonstrated db query.
4 freelancers are bidding on average $825 for this job
I have worked with data processing systems which have processed 70M transactions in 20min with multi-threading and multi-processing capabilities. Currently working on hadoop and big data technologies for anlaytics.