I've recently moved from web site testing to business data analysis.
One of my taks involved searchig for number of visits in apache access.log. Gzipped. For period of a year. On a popular MLM (multi-level marketing) website.
It took 10h to complete a search for a single pattern. I said we can do better and went for writting multi-threaded grep. I used bash because it was simples thing to compated with expected improvement. I was also immediately notified that hard drives won't handle much (searches are run on machine for logs backup) so there is no point in squizing power of CPUs, because I/O wait will kill any effort.
How do you do multi-threading in bash?
You have a loop over thousands of log files in which you run zgrep command. Just send it into background, right?
No - that will create so much processes that task switching will kill any improvements.
You can't really run more than 4 searches per CPU. That is still a lot if you have 8 CPUS :)
I've created a variable that incremented++ every time a new file was processed.
After reaching a limit of 4xCPU count, the script would wait
for all of zgreps to finish.
Then we start from beginning - another 32 processes.
This is suboptimal and does not even touch things like thread pool.
But it still improved search time 6 times.
With that solution I've hit hard drives limit (I/O wait was cause of load) and no further optimization was possible.
Further steps
I'm thinking about indexing logs with number of visits per month per URL.
Out of curiosity I'm tempted to write a thread pool based solution in python.
When searching for several patterns, I would immediately benefit from finding a common part and searching for that part - so that the heaviest part of looking at every line of access.log is not repeated.