I logged in into pingdom.com account to checkout if scheduled maintenance was performed without problems. It was, but I noticed several 404 errors.
I have dumped eror statuses into .csv for last week, and after short analysis found out that there is great number of 404 errors since the maintenance period. Clearly something was wrong.
$ head /tmp/down_prod_analysys.csv Status;Date and Time;Error unconfirmed_down;2013-09-16 09:05:29;HTTP Error 404 unconfirmed_down;2013-09-16 09:02:29;HTTP Error 404 unconfirmed_down;2013-09-16 08:56:29;HTTP Error 404 unconfirmed_down;2013-09-16 08:55:29;HTTP Error 404 down;2013-09-16 08:52:41;HTTP Error 404
Analysis
$ for i in `seq -w 8 16`; do ## useful tip: -w outputs numbers padded with zeros for equal width (depends on seq end) echo -n "$i September: "; ## -n: do not print \n grep "09-$i" /tmp/down_prod_analysys.csv | wc -l; ## wc -l: count how many lines done 08 September: 0 09 September: 4 10 September: 1 11 September: 5 12 September: 5 13 September: 2 14 September: 0 15 September: 237 16 September: 126
Some requests to homepage were returning 404 and some not.
What is wrong is that pingdom.com set the status to unconfirmed - clearly there is some kind of problem.
Unconfirmed
Pingdom retests immediately from different node to confirm a error. Most of there retries returned OK results. How is that possible? I made a quick local check myself to reproduce:
$ for i in `seq 1 20`; do wget -S 'http://our_home.page' -a log.txt; ## -S shows request headers, my favorite function in wget ## -a appends to logfile so that all output is stored done $ grep ERROR log.txt 2013-09-16 09:44:22 ERROR 404: Not Found. 2013-09-16 09:44:23 ERROR 404: Not Found. 2013-09-16 09:44:25 ERROR 404: Not Found. 2013-09-16 09:44:27 ERROR 404: Not Found. $
20% of requests got 404 error. Maybe one application node or one of http servers has a problem?
Turns out it is indeed so, one of application servers is misconfigured and always returns 404.
Lessons learned
- Unconfirmed error does not mean service works in round-robin load balancing architectures.
- Take time to go through your logs.
- If in doubt, investigate.