Monday, September 16, 2013

One node error is hard to detect by external monitoring

I logged in into pingdom.com account to checkout if scheduled maintenance was performed without problems. It was, but I noticed several 404 errors.

I have dumped eror statuses into .csv for last week, and after short analysis found out that there is great number of 404 errors since the maintenance period. Clearly something was wrong.

$ head /tmp/down_prod_analysys.csv 
Status;Date and Time;Error
unconfirmed_down;2013-09-16 09:05:29;HTTP Error 404
unconfirmed_down;2013-09-16 09:02:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:56:29;HTTP Error 404
unconfirmed_down;2013-09-16 08:55:29;HTTP Error 404
down;2013-09-16 08:52:41;HTTP Error 404

Analysis

$ for i in `seq -w 8 16`; do 
       ## useful tip: -w outputs numbers padded with zeros for equal width (depends on seq end)
       echo -n "$i September: "; 
       ## -n: do not print \n
       grep "09-$i" /tmp/down_prod_analysys.csv | wc -l;
       ## wc -l: count how many lines
  done

08 September: 0
09 September: 4
10 September: 1
11 September: 5
12 September: 5
13 September: 2
14 September: 0
15 September: 237
16 September: 126 

Some requests to homepage were returning 404 and some not.
What is wrong is that pingdom.com set the status to unconfirmed - clearly there is some kind of problem.

Unconfirmed

Pingdom retests immediately from different node to confirm a error. Most of there retries returned OK results. How is that possible? I made a quick local check myself to reproduce:

$ for i in `seq 1 20`; do
     wget -S 'http://our_home.page' -a log.txt; 
     ## -S shows request headers, my favorite function in wget
     ## -a appends to logfile so that all output is stored
  done
$ grep ERROR log.txt 
2013-09-16 09:44:22 ERROR 404: Not Found.
2013-09-16 09:44:23 ERROR 404: Not Found.
2013-09-16 09:44:25 ERROR 404: Not Found.
2013-09-16 09:44:27 ERROR 404: Not Found.
$ 

20% of requests got 404 error. Maybe one application node or one of http servers has a problem?

Turns out it is indeed so, one of application servers is misconfigured and always returns 404.

Lessons learned

  • Unconfirmed error does not mean service works in round-robin load balancing architectures.
  • Take time to go through your logs.
  • If in doubt, investigate.