System Monitoring on the Cheap with TAP and Smolder

Posted in: Technical Track

Like any self-respecting geek, I have a small network at home. It’s fairly well-behaved and stable, so I never really felt the burning urge of install a monitoring system. However, as I’ve been bitten by the full partition surprise at 9:30am on a Saturday morning a few times lately, I’ve… come to reconsider that position a little bit.

Of course, the right solution would be to install a real monitoring system like, say, Nagios or Zabbix. Trying to reinvent the wheel, and in this case a fairly beefy wheel, would be thoroughly silly. But it’d also be fun and educative. So I decided to do it anyway.

It must be said that my simple setup helps. I don’t need by-the-minute monitoring and notification. I don’t need escalation procedures (the only escalation procedure we have here is my lovely dragon climbing the stair to tell me the wireless network is down). I don’t need history graphs. I don’t need tungsten-coated SLA-enforcing mechanisms. I just need a battery of checks to run every day and something to notify me if something fails.

Part I – the Checks

First thing, we need something that is going to verify stuff, and is going to give us a report on whether the checks succeeded or failed. Sounds familiar, doesn’t it? So, why not leverage the good ol’ Perl testing ecosystem to do the deed?

One shortcoming of regular tests, though, is that a TAP test can only pass or fail, there’s no room for gradation of b0rkedness. Well, officially there’s no room. For my little system I’m bending the conventions a little bit and using TODO tests as a warning level.

For convenience, I’m creating a test function that I’ll use to report the status of the checks:

package Dumuzi;
use 5.10.0;
use strict;
use warnings;
use parent qw/ Test::Builder::Module Exporter/;
our ( $OK, $WORRY, $PANIC ) = 0..2;
our @EXPORT = qw/ $OK $WORRY $PANIC &check /;
our $TODO;
sub check([email protected]) {
    my ( $title, $result, @diag ) = @_;
    my $tb = __PACKAGE__->builder;
    given ( $result ) {
        when ( $OK ) {
            $tb->ok( 1, $title );
        }
        when ( $WORRY ) {
            local $TODO = "test has crossed the worrying line";
            $tb->ok( 0, $title );
        }
        when ( $PANIC ) {
            $tb->ok( 0, $title );
        }
    }
    $tb->diag( @diag );
}
1;

(If you are curious, the name Dumuzi comes from the Babylonian theme of my local network.)

With that, I can now write some test files. Say, one for the partition of the local machine:

use 5.10.0;
use strict;
use warnings;
use Test::More tests => 5;                      # last test to print
use Dumuzi;
use Sys::Statistics::Linux::DiskUsage;
my $lxs  = Sys::Statistics::Linux::DiskUsage->new;
my $stat = $lxs->get;
my $worry_percent = 80;
my $panic_percent = 90;
while( my ( $partition, $stats ) = each %$stat ) {
    check_partition( $partition, $stats );
}
sub check_partition {
    my $partition = shift;
    my %stats = %{ shift( @_ ) };
    my $result = $stats{usageper} < $worry_percent ? $OK
               : $stats{usageper} < $panic_percent ? $WORRY
               :                                     $PANIC
               ;
    check "partition $partition", $result, explain \%stats;
}

and one that verifies that all my websites are alive:

use strict;
use warnings;
use Test::More;
use Test::WWW::Mechanize;
my %links = (
    'https://babyl.dyndns.org'          => 'nAB-zONE',
    'https://babyl.dyndns.org/techblog' => 'Hacking Thy Fearful Symmetry',
    'https://kontext.ca'                => 'Kontext.ca',
    'https://academiedeschasseursdeprimes.ca' =>
      "Acad\x{e9}mie des chasseurs de prime",
    'https://michel-lacombe.dyndns.org' => 'Michel Lacombe, cartoonist',
);
plan tests => 2 * keys %links;
my $mech = Test::WWW::Mechanize->new;
while ( my ( $url, $title ) = each %links ) {
    $mech->get_ok($url);
    $mech->title_is( $title, "title of $url" );
}

Assuming that our files are arranged in a pseudo-distribution fashion (utility module under lib and tests under t) we can now run all our tests with prove:

$ prove -v -m  -l t
t/partition-usage.t ..
1..5
ok 1 - partition /dev/sdb10
# {
#   'free' => '2730692',
#   'mountpoint' => '/home/yanick/Pictures/OOS',
#   'total' => '4922124',
#   'usage' => '1941400',
#   'usageper' => 42
# }
not ok 2 - partition /dev/sdb9
#   Failed test 'partition /dev/sdb9'
#   at t/partition-usage.t line 31.
# {
#   'free' => '692284',
#   'mountpoint' => '/home/yanick',
#   'total' => '9843184',
#   'usage' => '8650884',
#   'usageper' => 93
# }
ok 3 - partition none
# {
#   'free' => '739308',
#   'mountpoint' => '/lib/init/rw',
#   'total' => '739308',
#   'usage' => '0',
#   'usageper' => 0
# }
ok 4 - partition /dev/sda1
# {
#   'free' => '25729788',
#   'mountpoint' => '/',
#   'total' => '36827144',
#   'usage' => '9226592',
#   'usageper' => 27
# }
not ok 5 - partition /dev/sdb11
#   Failed test 'partition /dev/sdb11'
#   at t/partition-usage.t line 31.
# {
#   'free' => '747364',
#   'mountpoint' => '/home/yanick/Pictures',
#   'total' => '4922124',
#   'usage' => '3924728',
#   'usageper' => 85
# }
ok 6 - partition /dev/sdb14
# {
#   'free' => '7466888',
#   'mountpoint' => '/home/yanick/music',
#   'total' => '10915320',
#   'usage' => '2893960',
#   'usageper' => 28
# }
# Looks like you planned 5 tests but ran 6.
# Looks like you failed 2 tests of 6 run.
Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/5 subtests
t/websites.t .........
1..10
ok 1 - GET https://michel-lacombe.dyndns.org
ok 2 - title of https://michel-lacombe.dyndns.org
ok 3 - GET https://kontext.ca
ok 4 - title of https://kontext.ca
ok 5 - GET https://babyl.dyndns.org/techblog
ok 6 - title of https://babyl.dyndns.org/techblog
ok 7 - GET https://academiedeschasseursdeprimes.ca
ok 8 - title of https://academiedeschasseursdeprimes.ca
ok 9 - GET https://babyl.dyndns.org
ok 10 - title of https://babyl.dyndns.org
ok
Test Summary Report
-------------------
t/partition-usage.t (Wstat: 512 Tests: 6 Failed: 3)
  Failed tests:  2, 5-6
  Non-zero exit status: 2
  Parse errors: Bad plan.  You planned 5 tests but ran 6.
Files=2, Tests=16,  3 wallclock secs ( 0.03 usr  0.01 sys +  0.36 cusr  0.06 csys =  0.46 CPU)
Result: FAIL

Part II – Gathering and Broadcasting the Results

For that part, I’m using Smolder. I create a project for each machine on which I want to run tests, and use the following script in a cronjob:

cd /home/dumuzi
prove -l -m -v --archive test_run.tar.gz
smolder_smoke_signal --server enkidu:8085 --file test_run.tar.gz --project `hostname`

Et voilà. I can now be notified of the checks via email, RSS feed or from Smolder’s web interface.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *