Monday, September 16, 2013

One node error is hard to detect by external monitoring

I logged in into account to checkout if scheduled maintenance was performed without problems. It was, but I noticed several 404 errors.

I have dumped eror statuses into .csv for last week, and after short analysis found out that there is great number of 404 errors since the maintenance period. Clearly something was wrong.

$ head /tmp/down_prod_analysys.csv 
Status;Date and Time;Error
unconfirmed_down;2013-09-16 09:05:29;HTTP Error 404
unconfirmed_down;2013-09-16 09:02:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:56:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:55:29;HTTP Error 404
down;2013-09-16 08:52:41;HTTP Error 404


$ for i in `seq -w 8 16`; do 
       ## useful tip: -w outputs numbers padded with zeros for equal width (depends on seq end)
       echo -n "$i September: "; 
       ## -n: do not print \n
       grep "09-$i" /tmp/down_prod_analysys.csv | wc -l;
       ## wc -l: count how many lines

08 September: 0
09 September: 4
10 September: 1
11 September: 5
12 September: 5
13 September: 2
14 September: 0
15 September: 237
16 September: 126 

Some requests to homepage were returning 404 and some not.
What is wrong is that set the status to unconfirmed - clearly there is some kind of problem.


Pingdom retests immediately from different node to confirm a error. Most of there retries returned OK results. How is that possible? I made a quick local check myself to reproduce:

$ for i in `seq 1 20`; do
     wget -S '' -a log.txt; 
     ## -S shows request headers, my favorite function in wget
     ## -a appends to logfile so that all output is stored
$ grep ERROR log.txt 
2013-09-16 09:44:22 ERROR 404: Not Found.
2013-09-16 09:44:23 ERROR 404: Not Found.
2013-09-16 09:44:25 ERROR 404: Not Found.
2013-09-16 09:44:27 ERROR 404: Not Found.

20% of requests got 404 error. Maybe one application node or one of http servers has a problem?

Turns out it is indeed so, one of application servers is misconfigured and always returns 404.

Lessons learned

  • Unconfirmed error does not mean service works in round-robin load balancing architectures.
  • Take time to go through your logs.
  • If in doubt, investigate.

Monday, September 9, 2013

Weird IllegalArgumentException in HashMap constructor

Exception in thread "main" java.lang.IllegalArgumentException: Illegal load factor: 0.0
        at java.util.HashMap.( Code))
        at java.util.HashMap.( Compiled Code))
        at pl.my_emploee_data.( Code))

This is a bug I was assigned to fix. First tried to look at the code and found this in the constructor:

this.servers = new HashMap(Limits.SERVERS);

What could be wrong with suggesting initial hash map size, right?

the JVM

My initial thought was: maybe it's not SUN (Oracle) Java and some incompatibilities occur. Maybe the HashMap constructor parameter was misunderstood? Found IBM Java 1.4.
This is known to be incompatible, but it's not the OpenJVM that simply does not work with most of our code.

the Fix

Just removed the initial size parameters, since default (16) size isn't that much different.

I was tempted to say 'premature optimization is the root of all errors', but actually this isn't the case.
Something must have broken the working code. Probably some fix pack, patch or change in the system.
Somehow the default 0.75 load factor must have been overriden to 0.0.

There are some hints in about
Since the code works now and issue seems to be maintanence-like, I only notified hosting staff about the problem to think about.

If however someone has some idea where we might have made such a mistake, please let me know.
(grep yelds nothing)

UPDATE: It seems that using default constructor (without size) was not a very good idea. Application went out of memory and I had to go back to the initial size. Simply specified the load factor, overriding the problematic parameter:

this.servers = new HashMap(Limits.SERVERS, 0.75f);

Lessons learned:

  • running code on different JVM than usual or used for testing, makes error more probable
  • fix the code with minimum impact
  • admit to yourself that things are not always that simple