Monday, September 16, 2013

One node error is hard to detect by external monitoring

I logged in into pingdom.com account to checkout if scheduled maintenance was performed without problems. It was, but I noticed several 404 errors.

I have dumped eror statuses into .csv for last week, and after short analysis found out that there is great number of 404 errors since the maintenance period. Clearly something was wrong.

$ head /tmp/down_prod_analysys.csv 
Status;Date and Time;Error
unconfirmed_down;2013-09-16 09:05:29;HTTP Error 404
unconfirmed_down;2013-09-16 09:02:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:56:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:55:29;HTTP Error 404
down;2013-09-16 08:52:41;HTTP Error 404

Analysis

$ for i in `seq -w 8 16`; do 
       ## useful tip: -w outputs numbers padded with zeros for equal width (depends on seq end)
       echo -n "$i September: "; 
       ## -n: do not print \n
       grep "09-$i" /tmp/down_prod_analysys.csv | wc -l;
       ## wc -l: count how many lines
  done

08 September: 0
09 September: 4
10 September: 1
11 September: 5
12 September: 5
13 September: 2
14 September: 0
15 September: 237
16 September: 126 

Some requests to homepage were returning 404 and some not.
What is wrong is that pingdom.com set the status to unconfirmed - clearly there is some kind of problem.

Unconfirmed

Pingdom retests immediately from different node to confirm a error. Most of there retries returned OK results. How is that possible? I made a quick local check myself to reproduce:

$ for i in `seq 1 20`; do
     wget -S 'http://our_home.page' -a log.txt; 
     ## -S shows request headers, my favorite function in wget
     ## -a appends to logfile so that all output is stored
  done
$ grep ERROR log.txt 
2013-09-16 09:44:22 ERROR 404: Not Found.
2013-09-16 09:44:23 ERROR 404: Not Found.
2013-09-16 09:44:25 ERROR 404: Not Found.
2013-09-16 09:44:27 ERROR 404: Not Found.
$ 

20% of requests got 404 error. Maybe one application node or one of http servers has a problem?

Turns out it is indeed so, one of application servers is misconfigured and always returns 404.

Lessons learned

  • Unconfirmed error does not mean service works in round-robin load balancing architectures.
  • Take time to go through your logs.
  • If in doubt, investigate.

Monday, September 9, 2013

Weird IllegalArgumentException in HashMap constructor

Exception in thread "main" java.lang.IllegalArgumentException: Illegal load factor: 0.0
        at java.util.HashMap.(HashMap.java(Compiled Code))
        at java.util.HashMap.(HashMap.java(Inlined Compiled Code))
        at pl.my_emploee_data.(Minute.java(Compiled Code))
....

This is a bug I was assigned to fix. First tried to look at the Minute.java code and found this in the constructor:

this.servers = new HashMap(Limits.SERVERS);

What could be wrong with suggesting initial hash map size, right?

the JVM

My initial thought was: maybe it's not SUN (Oracle) Java and some incompatibilities occur. Maybe the HashMap constructor parameter was misunderstood? Found IBM Java 1.4.
This is known to be incompatible, but it's not the OpenJVM that simply does not work with most of our code.

the Fix

Just removed the initial size parameters, since default (16) size isn't that much different.

I was tempted to say 'premature optimization is the root of all errors', but actually this isn't the case.
Something must have broken the working code. Probably some fix pack, patch or change in the system.
Somehow the default 0.75 load factor must have been overriden to 0.0.

There are some hints in http://www-01.ibm.com/support/docview.wss?uid=swg21610313 about -Djdk.map.althashing.threshold
Since the code works now and issue seems to be maintanence-like, I only notified hosting staff about the problem to think about.

If however someone has some idea where we might have made such a mistake, please let me know.
(grep jdk.map.althashing.threshold yelds nothing)


UPDATE: It seems that using default constructor (without size) was not a very good idea. Application went out of memory and I had to go back to the initial size. Simply specified the load factor, overriding the problematic parameter:

this.servers = new HashMap(Limits.SERVERS, 0.75f);

Lessons learned:

  • running code on different JVM than usual or used for testing, makes error more probable
  • fix the code with minimum impact
  • admit to yourself that things are not always that simple

Friday, July 12, 2013

Do not assume existence of any data when creating a uptime monitoring sensor.

AlertFox

I've been evaluating AlertFox monitoring service lately, which I like alot.
It has awesome features, killing instantly services like pingdom.com.
I'm able to do anything on my site that a real user can - javascript pitfalls are not a problem.

I also get a screenshot of a problematic situation, which is priceless in case of a 500 error (it contains the error_id that leads programmers to stacktrace. Pretty useful, right?).

To monitor if the website was working properly I created a script that:
  • enters website and uses search bar 
  • evaluates if the product was found

It worked like charm till yesterday 10:00 AM. Got a alert e-mail saying that the site was down 50% of a time. So I went to customer service with that info, to notify them of the problem. 
It turned out to be a false alarm sadly. The product was no longer available, it was deleted.

Lessons learned

  • Do not assume as constant the existence of data or (editable) labels when creating a uptime monitoring sensor
  • Rely instead only on code features, and even then - watch out for system updates
  • Simpler is again better

Friday, March 22, 2013

The universe works agains us - entrophy!

I've been reading Stephen Hawking's The Theory of Everything this morning. He explains the entropy of black holes.

Between the lines, I was able to understand that entropy, understood as, chaos or lack of order, rises constantly. It rises because time elapses...

I got enlighted: the project or code, left alone, will get worse in time, when we do nothing.
A simple act of abstaining from action, lack of management, lack of trying to bring order, makes things worse.
This of course is just a analogy, not a law. But let's examine it..


Lack of action = lack of order

Example 1
The team works hard on developing the system. In the meantime, the test acceptance phrase takes place, and 50 bugs are reported.
The team continues the work on developing, neglecting the bugs - "we'll do it later".

That simple decission makes things worse. How? 
* broken windows (Pragmatic programmer)
* programmer is no longer responsible for delivering working code, since some things do not work already
* overall quality drops rapidly because of attitude
Lack of constant quality requirements (lack of order) makes things worse.

Example 2
The team works serveral months on a project now, and 150 bug/improvement issues are due. The project is near the deadline. The huge amount of work is discouraging - no light in a tunnel, no hope to do a good job.
For political reasons, dropping some functionality in a trade-off for quality is not going to happen. That would be a great, wise decision, but such wisdom would require a single, strong leader. This isn't happening in big bank corporations (our client is one).

I proposed some rearrangement of tasks for developers in yesterday's article; here is the summary:
A developer is required to finish the overall process/part of the system - develop all changes, fixes and improvements. He/she then signalises: "that part of the system is done".
The amount of work does not change, but the "getting work done" attitude gets a huge positive kick. The hope is restored.
Moreover, even when not everything could be done before deadline, at least most parts of the system will work perfectly.

Another example, how simple act of ordering of tacks, brings quality to the project.

And how abstaining from action, brings more trouble.

Entropy is your enemy.

 Do something, manage some change, bring order, rethink tasks... or face failure.

Thursday, March 21, 2013

Improving productivity when project gets messed up

My team is in the middle of serveral-months long development process of website for bank client. We had several stages in project, currently we are on last one.

System is soon to be opened for the world, yet quality still is poor.
There is no one part of the system we could say "it works".
As a tester, I feel it's my duty to improve overall quality.

Overloaded team

The team seems to overloaded with jira tasks. There are three kinds of them:
* totally new features (agreed upon with our client, and paid for)
* bug fixes
* improvements to existing features

Current development mode could be summarised by: "develop new features, and we'll get back to bugs later".

My first approach (after high eyebrow rise and some breathing exersises to calm myself) was: "Let's not break the system - please let's have overall quality as a first goal". This was rejected by the team.
Mainly because the project would be a political failure, should we fail to deliver 100% of requested functionality. I asked several times whether 100% functionality must work, and it seemed that "it should" :-)

Getting parts of the system done

Today I proposed another approach. When developer changes part of the system (a screen, or a process), he/she should:
* read the specification (official document detailing the way system works, the design) and make sure that particular system feature works exactly as describet
* look at jira issues, find and resolve all of task that are related to given feature/screen/process

After that, no improvements or changes are allowed. That particular feature is finished. Sure, there might be bugs, but no changes are allowed.

This way, we could get small, but importand quality improvements with each new system version (every 2 days). This way, the system would finally work properly someday.

This is only a change of view

Developers still have the same amount of work to be done. But my approach fixed the "context switch" problem and, even more importantly, leaves a feeling of job being done. Some parts of the system may now be ticked as done.
The team gets visibly closer and closer to the final goal.

My hope is that this method gets accepted...