Monitoring: better safe than sorry…

Stumbling upon the Holy time-travellin’ DRBD, batman! blog post there’s only one thing to be said …

Be strict in what you emit, liberal in what you accept1

is simply not true when dealing with mission-critical systems.

It’s ok to be alerted on upgrading a machine because the “old, working” RegEx that did the parsing doesn’t match anymore2; it’s not a problem to get an email when someone adds the 100th DRBD resource and causes the grep to fail; and so on.

Better to have a few false positives when you’re actively changing things than to get a false negative that costs you months of data; that’s what an assert (and monitoring isn’t that different) is for, after all.

Keep monitoring strict, and let it fail loudly on unexpected things – after the first few occurrences they’re not unexpected anymore and can be dealt with.

  1. Thanks, Larry
  2. eg. because /proc/drbd got an additional field

Leave a Reply

Your email address will not be published. Required fields are marked *

What is 11, multiplied by 4 ?
Please leave these two fields as-is:
IMPORTANT! To be able to proceed, you need to solve the following simple math (so we know that you are a human) :-)