Friday, April 17, 2015

Multi-threaded grep in bash

I've recently moved from web site testing to business data analysis.
One of my taks involved searchig for number of visits in apache access.log. Gzipped. For period of a year. On a popular MLM (multi-level marketing) website.

It took 10h to complete a search for a single pattern. I said we can do better and went for writting multi-threaded grep. I used bash because it was simples thing to compated with expected improvement. I was also immediately notified that hard drives won't handle much (searches are run on machine for logs backup) so there is no point in squizing power of CPUs, because I/O wait will kill any effort.

How do you do multi-threading in bash?

You have a loop over thousands of log files in which you run zgrep command. Just send it into background, right?
No - that will create so much processes that task switching will kill any improvements.
You can't really run more than 4 searches per CPU. That is still a lot if you have 8 CPUS :)

I've created a variable that incremented++ every time a new file was processed.
After reaching a limit of 4xCPU count, the script would wait for all of zgreps to finish.
Then we start from beginning - another 32 processes.

This is suboptimal and does not even touch things like thread pool.

But it still improved search time 6 times.
With that solution I've hit hard drives limit (I/O wait was cause of load) and no further optimization was possible.

Further steps

I'm thinking about indexing logs with number of visits per month per URL.
Out of curiosity I'm tempted to write a thread pool based solution in python.

When searching for several patterns, I would immediately benefit from finding a common part and searching for that part - so that the heaviest part of looking at every line of access.log is not repeated.

Sunday, October 12, 2014

Saving screenshot after each step in py.test framework that runs Selenium

Testing websites with Selenium is fun. Py.Test framework is really cool.
Joining those two and running tests inside of PyCharm IDE makes you wanna cry with joy. That is if you previously worked with iMacros :)

But sometimes testing takes hours, is run inside of crontab and you only need to check the results. The TimeoutException or NoSuchElementException tell you nothing if you haven't looked what the browser state was. The screenshot would really be helpful.

Acutally screenshot made after each test, regardless of the result would be helpful too - for example to detect layout errors or errors not covered by asserts.
I found a way to do this in pretty simple way. We will need a fixture that

  • is used automaticaly with every test step
  • calls the selenium driver to make a screenshot

I found no way yet to learn test outcome and make use of it when doing screenshot (like naming file with ``FAIL`` prefix or so on).

The key feature here is that our screenshot-making fixture uses the selenium driver fixture. And it is the same fixture that is provided to test functions.

Say we have selenium driver fixture that is module-wise parametrized with list of countries. That is a new browser session is delivered to module functions in a loop of countries. Eeach test module gets a browser and is run in a loop of countries given in params=:

__author__ = 'sigviper'

import os
import pytest
import datetime

@pytest.fixture(scope="module", params=["pl", "de"])
def driver(request):
    """Selenium driver that creates a loop by countries"""
    class FirefoxImprovedBySigviper(object):
        def screenshot(self, request_for_test_function, session_test_timestamp):
            country = request.param     # module-wise request (each param creates new)
            out_dir = "/tmp/shutter-{timestamp}/{country}/".format(timestamp=session_test_timestamp, country=country)
            try:
                os.makedirs(out_dir)
            except OSError:
                pass

            fname = "{out_dir}/{country}_{module}_{function}_{timestamp}.log".format(
                country=country,
                out_dir=out_dir,
                module=request_for_test_function.module.__name__,
                function=request_for_test_function.function.__name__,
                timestamp=get_timestamp()
            )
            f = open(fname, "ab")
            f.write(repr(self) + "\n")
            f.close()

    return FirefoxImprovedBySigviper()


@pytest.fixture(scope="function", autouse=True)
def shutter(request, driver, session_test_timestamp):
    """Screenshot-making fixture"""
    def fin():
        driver.screenshot(request, session_test_timestamp)

    request.addfinalizer(fin)

def get_timestamp():
    """Provide formatted current timestamp"""
    return datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

@pytest.fixture(scope="session", autouse=True)
def session_test_timestamp(request):
    """Provide timestamp of test start time - fixed for a session"""
    return get_timestamp()

Now the driver must implement screenshot logic. Please take a carefull look on usage of request and request_for_test_function variables - they are not the same. The request is a attribute of driver fixture that is injected into FirefoxImprovedBySigviper class. The request_for_test_function parameter is given to screenshot function by shutter fixture that is ran before each test function call. The scope in test session of those variables is different. The module-wise request knows nothing about current function. On the other hand, request_for_test_function function-wise knows nothing about current param for the loop in driver. Consequently we need to use both.

Order of execution:

create session-wide timestamp and provide to all modules and functions
for each module create a Selenium driver:
    for each parameter in driver run all test functions in a module:
       save screenshot after each function call

Let's say we have two test modules: TestCatching and TestCopycatch each with three similarly named functions: *_ok, *_failed and *_raises
The test session in PyCharm looks like this:

The disk output with screenshot (here .log files, because I needed simple and quick sandbox for testing) looks like this:

viper@OptiPlex780:/tmp$ ls -lrctR shutter-2014-10-12_16\:16\:07/
shutter-2014-10-12_16:16:07/:
razem 8
drwxrwxr-x 2 viper viper 4096 paź 12 16:16 pl
drwxrwxr-x 2 viper viper 4096 paź 12 16:16 de

shutter-2014-10-12_16:16:07/pl:
razem 24
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 pl_test_catching_test_1ok_2014-10-12_16:16:08.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 pl_test_catching_test_2assert_failed_2014-10-12_16:16:09.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 pl_test_catching_test_3raises_2014-10-12_16:16:10.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 pl_test_copycatch_test_1copy_ok_2014-10-12_16:16:16.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 pl_test_copycatch_test_2copy_assert_failed_2014-10-12_16:16:18.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 pl_test_copycatch_test_3copy_raises_2014-10-12_16:16:19.log

shutter-2014-10-12_16:16:07/de:
razem 24
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 de_test_catching_test_1ok_2014-10-12_16:16:11.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 de_test_catching_test_2assert_failed_2014-10-12_16:16:12.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 de_test_catching_test_3raises_2014-10-12_16:16:13.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 de_test_copycatch_test_1copy_ok_2014-10-12_16:16:22.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 de_test_copycatch_test_2copy_assert_failed_2014-10-12_16:16:24.log
-rw-rw-r-- 1 viper viper 57 paź 12 16:16 de_test_copycatch_test_3copy_raises_2014-10-12_16:16:25.log

All screenshots saved. Easy to find and sort. Names match country and test name. In case of error, timestamp is saved (sometimes helps developers narrow down periods in logs).

I like it. I'll just copy&paste it into my working code monday morning! :)

Thursday, July 31, 2014

Quality of software vs. quality of trucks

When company is to choose between two trucks (cheaper and more expensive) to buy, the schema is pretty straightforward. One is cheaper, of lower quality - the other offers quality and reliability. Some companies will choose the cheaper, most will probably go with reliability.

This is easy decision, basic competence of any company owner or manager - ask them about such a choice and not even a second and they start talking.

This is not the case with software

Most people ordering software cannot even understand what a quality and reliability of software is. With trucks you have publicly available reliability reports.
Trucks quality evaluation is a common knowledge. Evaluating reliability of software is area clouded with misconceptions and poor measure ideas.

Site availability in % per month is one of greatest tools to evaluate quality. But it does not say whether a user could even log in into system for the whole time.

You then go to measuring process availability in % per month which this is difficult and needs special agreements between companies.

The deeper you go into trying to understand quality and reliability the bigger obstacles you face.

This is why we stick with known errors or problems

Simple accounting tells us some fixes are too expensive to be made. Too expensive means: it's cheaper to constantly fix bugs due to that problem than fix the problem.

Simple accounting cannot account for software reliability, final user experience and so on.

Possible measures

By investing into Real User Monitoring we could really reach some level of confidence in software quality. Ammount of error statuses per hour gives you pretty good picture about whether processes work or not.

Other ideas:

  1. RUM: ammount of errorpages per hour for logged-in users
  2. window.onerror -> log problem via ajax for reporting and analysis
  3. ammount of errors in system logs (assuming java here, you need to count stacktraces)

Processes for quality

No logging or reporting will improve software if you do nothing with gained knowledge. Logs need to be inspected and bugs need to be reported.
Then those bugs need to be fixed and your client will have to agree to pay for this.

Information about found and fixed bugs need to be avaliable and transparent for both developing company and it's clients.

Third-party testing raids and code inspections should be as common and obvious as asking other doctor for opinion or visiting independent mechanic workshop.
Are we ready for this?


Paying more for better, more reliable software is a difficult choice. Let's not blame managers for their inability to understand our point of view.

Yes, quality is expensive as commonly said. Let's forge this slogan into real procedures and measures.

Maybe one day simple accounting will tell manager: you should pay more, you'll earn more with better software.

Tuesday, February 11, 2014

Retest and close what you reported

Tester's work is to find bugs, right?
Not exactly in my opinion: our job is to get rid of bugs and promote software quality.

Our job is to make sure bugs were fixed properly, with no new bugs introduced.
We also should make sure the fix is not lost, but:
  • deployed to production environments
  • pushed to the right branch

Resolution:Fixed

I've been often mislead by resolution:Resolved and "Fix version" field in JIRA. The bug might have status "fixed", and:

  • .. is not fixed at all or not good enough. Retest and Reopen!
  • .. is fixed, but not deployed (not commited/pushed ets). Retest and Reopen!
  • .. is fixed by changing system or deleting functionality. Retest and consult system architects.
  • .. is really fixed. Retest and Close!

Please keep in mind that the issue is fixed, when it's fixed on production and development of further system versions include that fix. What good is a fix if end users never see it, or next system version does not include the fix (i.e. reintroduces the bug)?

How do we get rid of bugs then?

  1. We retest the fix on the test/staging environment and Close the issue in bugtracking.
  2. We make sure that proper repository branches include our fix. It might not if:
    1. Client requests that only selected fixes from branch deployed to test are to be deployed on production.
    2. The fix is commited on a branch that is never merged with branch that the next system version is made of.
  3. We periodicaly retest important fixes to make sure no regression occurred
  4. We need to understand the cause of the error and expand test cases based on that knowledge - talk to programmers, and demand analysis/cause description in issue comments. The "done" comment is not good enough.

Periodic retesting?

With 10 bugs every month, we have a whooping 120 test cases for each year of project. That's quite a heap of things to retest, right?

To manage such a pile of imporant knowledge, we should document test cases that need to be done regularly.
Some of them should enter our automatic regression testing scenarios.

Automation of the regression testing should be based on analysys of causes of errors in the system. For example if search engine has trouble when filtering is employed - we should add such a tests and run it often.

If we don't do this - nobody will

Because only we know in details what was broken - only we are able to make sure it's fixed.
No programmer, chief programmer, project manager or developer is able to take up the task of getting rid of detailed system problems.

Monday, September 16, 2013

One node error is hard to detect by external monitoring

I logged in into pingdom.com account to checkout if scheduled maintenance was performed without problems. It was, but I noticed several 404 errors.

I have dumped eror statuses into .csv for last week, and after short analysis found out that there is great number of 404 errors since the maintenance period. Clearly something was wrong.

$ head /tmp/down_prod_analysys.csv 
Status;Date and Time;Error
unconfirmed_down;2013-09-16 09:05:29;HTTP Error 404
unconfirmed_down;2013-09-16 09:02:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:56:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:55:29;HTTP Error 404
down;2013-09-16 08:52:41;HTTP Error 404

Analysis

$ for i in `seq -w 8 16`; do 
       ## useful tip: -w outputs numbers padded with zeros for equal width (depends on seq end)
       echo -n "$i September: "; 
       ## -n: do not print \n
       grep "09-$i" /tmp/down_prod_analysys.csv | wc -l;
       ## wc -l: count how many lines
  done

08 September: 0
09 September: 4
10 September: 1
11 September: 5
12 September: 5
13 September: 2
14 September: 0
15 September: 237
16 September: 126 

Some requests to homepage were returning 404 and some not.
What is wrong is that pingdom.com set the status to unconfirmed - clearly there is some kind of problem.

Unconfirmed

Pingdom retests immediately from different node to confirm a error. Most of there retries returned OK results. How is that possible? I made a quick local check myself to reproduce:

$ for i in `seq 1 20`; do
     wget -S 'http://our_home.page' -a log.txt; 
     ## -S shows request headers, my favorite function in wget
     ## -a appends to logfile so that all output is stored
  done
$ grep ERROR log.txt 
2013-09-16 09:44:22 ERROR 404: Not Found.
2013-09-16 09:44:23 ERROR 404: Not Found.
2013-09-16 09:44:25 ERROR 404: Not Found.
2013-09-16 09:44:27 ERROR 404: Not Found.
$ 

20% of requests got 404 error. Maybe one application node or one of http servers has a problem?

Turns out it is indeed so, one of application servers is misconfigured and always returns 404.

Lessons learned

  • Unconfirmed error does not mean service works in round-robin load balancing architectures.
  • Take time to go through your logs.
  • If in doubt, investigate.

Monday, September 9, 2013

Weird IllegalArgumentException in HashMap constructor

Exception in thread "main" java.lang.IllegalArgumentException: Illegal load factor: 0.0
        at java.util.HashMap.(HashMap.java(Compiled Code))
        at java.util.HashMap.(HashMap.java(Inlined Compiled Code))
        at pl.my_emploee_data.(Minute.java(Compiled Code))
....

This is a bug I was assigned to fix. First tried to look at the Minute.java code and found this in the constructor:

this.servers = new HashMap(Limits.SERVERS);

What could be wrong with suggesting initial hash map size, right?

the JVM

My initial thought was: maybe it's not SUN (Oracle) Java and some incompatibilities occur. Maybe the HashMap constructor parameter was misunderstood? Found IBM Java 1.4.
This is known to be incompatible, but it's not the OpenJVM that simply does not work with most of our code.

the Fix

Just removed the initial size parameters, since default (16) size isn't that much different.

I was tempted to say 'premature optimization is the root of all errors', but actually this isn't the case.
Something must have broken the working code. Probably some fix pack, patch or change in the system.
Somehow the default 0.75 load factor must have been overriden to 0.0.

There are some hints in http://www-01.ibm.com/support/docview.wss?uid=swg21610313 about -Djdk.map.althashing.threshold
Since the code works now and issue seems to be maintanence-like, I only notified hosting staff about the problem to think about.

If however someone has some idea where we might have made such a mistake, please let me know.
(grep jdk.map.althashing.threshold yelds nothing)


UPDATE: It seems that using default constructor (without size) was not a very good idea. Application went out of memory and I had to go back to the initial size. Simply specified the load factor, overriding the problematic parameter:

this.servers = new HashMap(Limits.SERVERS, 0.75f);

Lessons learned:

  • running code on different JVM than usual or used for testing, makes error more probable
  • fix the code with minimum impact
  • admit to yourself that things are not always that simple

Friday, July 12, 2013

Do not assume existence of any data when creating a uptime monitoring sensor.

AlertFox

I've been evaluating AlertFox monitoring service lately, which I like alot.
It has awesome features, killing instantly services like pingdom.com.
I'm able to do anything on my site that a real user can - javascript pitfalls are not a problem.

I also get a screenshot of a problematic situation, which is priceless in case of a 500 error (it contains the error_id that leads programmers to stacktrace. Pretty useful, right?).

To monitor if the website was working properly I created a script that:
  • enters website and uses search bar 
  • evaluates if the product was found

It worked like charm till yesterday 10:00 AM. Got a alert e-mail saying that the site was down 50% of a time. So I went to customer service with that info, to notify them of the problem. 
It turned out to be a false alarm sadly. The product was no longer available, it was deleted.

Lessons learned

  • Do not assume as constant the existence of data or (editable) labels when creating a uptime monitoring sensor
  • Rely instead only on code features, and even then - watch out for system updates
  • Simpler is again better