The 'Armoured Sickener' Effect: Why Your Monitoring System Is Breaking Your On-Call Team

Reading the recent news about Britain’s £6 billion Ajax armored vehicles gave me flashbacks to my time managing NOC floors. The report is damning: the vehicles cause debilitating physical symptoms—nausea, tinnitus, and vibration injury—in the troops riding them. The official investigation’s conclusion? There is no single cause. It’s just a combination of bad bolts, cold air, and noise.

The solution? The Army has been told to accept the vehicles back and “grin and bear it.”

In the IT world, we see the “Ajax” scenario play out every single day. You buy expensive RMM platforms (Ninja, ConnectWise, Datto) and standalone monitoring tools. You deploy them across your infrastructure. But instead of protecting your team, these tools bombard them with vibration and noise—useless alerts, 3 AM pages for non-issues, and false positives that cause actual physical burnout.

And when your sysadmins ask for relief, the answer is often, “That’s just how the tool works. Deal with it.”

The Problem in Depth: When Your RMM Makes You Sick

If you are an MSP technician or an internal IT sysadmin, you know the feeling. You are carrying the on-call pager. It’s 2:00 AM. The phone buzzes. Your heart rate spikes. You roll over, unlock your laptop, and log into three different portals to figure out what’s happening.

The issue? A Windows Server spiked CPU to 100% for 15 seconds during a scheduled backup. The monitoring tool saw a spike, generated a “Critical” alert, and routed it to you.

This is the signal quality problem. Your monitoring tools are acting like the Ajax turret—spinning wildly and generating deafening noise without hitting a target.

Why Existing Tools Fail

Siloed Architecture: Your RMM knows the device is down, but it doesn’t know that ServiceNow has a maintenance window scheduled for that server. Your Helpdesk knows the user is complaining, but it doesn’t know the network topology link is flapping. You have the data, but it’s scattered across four different consoles that don’t talk to each other.
The “No Single Cause” Trap: Just like the Ajax investigation, legacy monitoring tools struggle to correlate root cause. Did the server crash because of hardware, a patch, or an application error? Without context, every alert is a Cold Start. The on-call engineer has to manually investigate every single incident from scratch.
The Real Cost: Alert Fatigue:
- Staff Morale: When 90% of your alerts are noise, your team stops trusting the system. They start ignoring pages, which means they miss the one critical alert that matters (like the Exchange database actually going down).
- SLA Misses: Time spent investigating false positives is time not spent resolving actual user issues.
- Tool Sprawl: You are paying for a RMM, a separate Helpdesk, a separate APM tool, and a separate paging system (like PagerDuty). You are paying a premium to give your team a headache.

How AlertMonitor Solves This: Signal Over Noise

At AlertMonitor, we built our platform on a simple premise: Alert fatigue isn’t a volume problem; it’s a signal quality problem.

We don’t just notify you that something happened; we tell you what happened, where it happened, and what healthy looks like for that specific device.

1. Full Context Payloads

Unlike a standard Nagios or SolarWinds alert that just sends Host: Server1 - Status: Down, AlertMonitor enriches every alert with full context:

Device Details: OS version, patch level, and client name.
Topology: Is this a dependency failure? If the switch is down, we suppress the alerts for the 50 workstations behind it automatically.
Historical Baseline: Is this CPU spike normal for Tuesday at 2 AM, or is it an anomaly?

2. Smart Deduplication and Suppression

If a server is in a maintenance window for patching, AlertMonitor suppresses the alerts automatically. No manual “mute” required.

3. Unified Workflow

When an alert fires, the on-call tech gets a notification with a link that opens a single pane of glass. They can see:

The alert.
The topology map.
The ticket in the integrated Helpdesk.
Remote control access (RMM integration) to fix it immediately.

No more tab-switching. No more logging into three different tools to triage one incident.

Practical Steps: Improving Signal Quality Today

You can't fix the Ajax tank, but you can fix your monitoring. The key is to stop sending raw data to your on-call team and start sending context.

If you are using a monitoring tool that allows custom scripts, stop using simple “Is Process Running?” checks. Instead, use scripts that verify the health of the service before generating an alert.

Here is a PowerShell example for a Windows environment. Instead of just checking if the Print Spooler is running (which causes alerts if you restart it during maintenance), this script checks if the service is stopped and if it has been stopped for more than 2 minutes. This prevents transient blips from waking up your team.

PowerShell

# Advanced Service Check - AlertMonitor Example
# This script checks if a service is stopped for a sustained period to avoid transient alerts.

$ServiceName = "Spooler"
$ThresholdMinutes = 2

$Service = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue

if ($Service.Status -ne 'Running') {
    # Check how long the service has been in this state
    # Note: This is a simplified logic; production code may need to query event logs for start time
    $CurrentTime = Get-Date
    # We simulate a check duration or check event log 7036 for the last 'Stopped' entry
    $StoppedEvent = Get-WinEvent -FilterHashtable @{LogName='System'; ID=7036; ProviderName='Service Control Manager'} -MaxEvents 10 | Where-Object {$_.Message -like "*$ServiceName*stopped*"} | Select-Object -First 1
    
    if ($StoppedEvent) {
        $Duration = $CurrentTime - $StoppedEvent.TimeCreated
        if ($Duration.TotalMinutes -gt $ThresholdMinutes) {
            Write-Host "CRITICAL: Service $ServiceName has been stopped for $($Duration.TotalMinutes) minutes."
            exit 2 # Standard Nagios/AlertMonitor Critical Code
        }
        else {
            Write-Host "OK: Service $ServiceName is stopped, but only for $($Duration.TotalMinutes) minutes (within threshold)."
            exit 0
        }
    }
    else {
        # Fallback if no event found (Service might not exist or logs cleared)
        Write-Host "WARNING: Service $ServiceName is not running, could not determine duration."
        exit 1
    }
}
else {
    Write-Host "OK: Service $ServiceName is running."
    exit 0
}

Integrating with AlertMonitor

When you integrate scripts like this into AlertMonitor, you drastically reduce the “vibration” of false alerts.

Ingest the Output: Configure AlertMonitor to accept the exit code and the text output.
Apply Escalation Logic: Set the escalation policy to page the On-Call Sysadmin only if the state persists for >5 minutes. If it resolves in 2, log it as an auto-resolved ticket for visibility during business hours, but don't page anyone.
Topology Awareness: Link this service check to the server it runs on. If the server agent reports “Offline,” AlertMonitor will automatically mark this dependent service check as “Unreachable” rather than “Critical,” saving your team from investigating symptoms of a root cause they already know about.

Don't let your monitoring platform be an Armoured Sickener for your team. Give them the context they need to do their jobs without the burnout.

Related Resources

AlertMonitor Alert Management & On-Call Operations AlertMonitor Platform Overview Book a Demo Alert Management & On-Call Operations Resources