← Back to blog

Think Like a CISO: Alerts That Matter (and the Ones That Don't)

9 min read
Think Like a CISO: Alerts That Matter (and the Ones That Don't)

It is 02:14 on a Tuesday. The pager is buzzing. You unlock the phone, squinting, and the notification bar reads FIRING: AlertmanagerFailedReload, AlertmanagerFailedToSendAlerts, KubeAggregatedAPIDown, NodeFilesystemAlmostOutOfFiles (warning), NodeFilesystemAlmostOutOfFiles (info), TargetDown. Six pages. You start scrolling. By the time you have read the third one (was that the staging cluster or production?), you have already half-acknowledged the rest without reading them. You roll back over. The pager goes quiet. Two hours later, your customers find the real outage before you do.

This is the alert problem. It is not that we have too little monitoring. It is the opposite. Most teams have so much monitoring that the signal is buried under its own noise, and the human who is supposed to act on it has learned, correctly, to ignore most of what their tools tell them. Alert fatigue is not a personal failing. It is a design failure.

If there is a single CISO-mindset shift that separates teams who recover from incidents from teams who get destroyed by them, it is this. Treat every alert as a cost, not a feature. Every alert spends a finite resource: a human's attention at 02:14, when their judgment is degraded, their context is gone, and their willingness to act is inversely proportional to how often they have been woken up for nothing.

The trap: every alert looks useful in isolation

Here is how most monitoring stacks end up the way they do. Someone deploys a new service. The default Helm chart, the default Datadog integration, the default SIEM ruleset, whatever it happens to be, ships with two dozen alerts already configured. They look reasonable. HighMemoryUsage, HighCPUThrottling, PodCrashLooping. Nobody removes them. Six months later, someone adds a custom alert because they had an incident: DatabaseSlowQueries. A year after that, a different person adds DiskAlmostFull at 80%. Then somebody else adds it again at 85%, and again at 90%, because the original 80% was too noisy so they raised it but never deleted it. The Watchdog firing is now indistinguishable from the actually-broken cluster on the other side of the building.

The result is that your alert system is doing exactly what it was designed to do. The trouble is that what it was designed to do, accreted alert by alert over years, was never coherent. Nobody sat down and said "these are the things I want to be paged for, at 02:00, with my phone on the bedside table." It just grew.

This is why alerts are technical debt. Every alert in your runbook either earns its place every time it fires, or it actively makes the next real incident slower to detect.

The actionability test

The single most useful filter I know, borrowed nearly verbatim from Google's SRE workbook, is the actionability test. For every alert in your system, ask:

  • If this fires at 02:00, what specific action does the on-call human take?
  • Could that action be automated? If yes, automate it instead of paging.
  • If the action is "investigate," is there a documented next step? If no, the alert is a research project, not an alert.
  • If the action is "do nothing because it will self-resolve," delete the alert. It is lying to you about urgency.

An alert that survives this test gets paged at the severity that matches its action. An alert that doesn't survive belongs in one of three places: a ticket queue for daytime review, a dashboard for trend-watching, or the bin. There is nothing in between.

The taxonomy most teams converge on, once they think about it explicitly, has three tiers.

  • Page. The on-call wakes up. Reserved for customer-visible impact or imminent risk of it. Symptoms, not causes. "Website returns 5xx for 5 minutes" is a page. "One node has high CPU" is not.
  • Ticket. Somebody looks at it during business hours. For things that need human attention but not right now. Capacity planning lives here. Most warnings live here.
  • Dashboard. The metric exists, you can graph it, you can build SLOs on top of it, but nobody is notified. Most of your observability data should live here.

The universal, persistent mistake is putting "ticket"-class problems on the pager. Every time you do that, you train your future self to ignore the pager.

A worked example: the looping self-alert

Here is a failure pattern you will eventually meet, in some flavor, whether you run an enterprise SOC or a homelab.

Imagine your alerting stack routes through two paths. One is a paging integration that hits the on-call's phone. The other is a webhook integration that hands events off to a downstream automation: a chat bot, a runbook executor, a ticketing system, something. Standard setup.

The downstream system dies. You don't notice immediately, because the only thing that pages you when it dies is, yes, itself. The webhook starts failing. Your alerting platform has a meta-alert for that condition: SendingFailed, WebhookUnhealthy, whatever it is called in your tool. The meta-alert fires. Where does it route? Through the same alerting platform. To the phone and to the failing webhook. The webhook fails again. The meta-alert re-fires. Loop.

By the time anyone notices, the phone has received the same self-alert dozens of times in a short window. And during that window the on-call missed two real alerts, including one that mattered, because they were buried in the self-loop noise.

The fix is small. A routing rule that drops self-alerts from going to the failing receiver, plus a separate external canary that watches the alerting system itself from outside. The lesson, though, is bigger than the fix. The alert system designed to protect you became the thing that hid the real incident from you, because every alert in it was treated as equally important. A self-loop of one alert pushed down two real ones.

This is the failure mode that should haunt you when you design alerting. Not "what if we miss a fire." "What if the smoke detector goes off so often during burnt toast that we sleep through the actual fire."

Noise-reduction primitives nobody uses enough

Whatever you are running, whether Prometheus + Alertmanager, PagerDuty, OpsGenie, or a SIEM with built-in routing, the same four primitives exist for a reason, and most teams underuse all four.

  • Grouping. If twelve services on the same node go unhealthy because the node died, you want one page that says "node X is down, 12 services affected." Not twelve pages. Group by the dimensions that make the on-call's life easier (usually alertname plus scope: namespace, host, cluster) so the page tells a story instead of just listing symptoms.
  • Inhibition. When a critical fires, suppress the warnings and infos for the same scope. A node-down alert should silence every "high latency from that node" warning it would otherwise generate. This is the single highest-leverage configuration knob most people never touch.
  • Severity-aware routing. The phone, the chat channel, and the dashboard should not all receive everything. Critical goes to the page. Warning goes to the daytime channel. Info gets logged only. Your routing tree is your noise filter.
  • Silences. When you are doing planned work, silence the alerts before you start, with an expiry. Don't acknowledge the same flap fifteen times. Don't train yourself to ignore the pager just because you knew this one was coming.

None of these are exotic. Every serious alerting platform has had them for years. The reason they are underused is that they require thinking about your alerts as a system rather than as a checklist of bad things to detect.

Symptoms versus causes

The other rule I keep returning to: alert on symptoms, not causes. If the database is slow, that is a cause. If the user-facing checkout returns 5xx, that is a symptom. The symptom is what your customer experiences. The cause is what you investigate after the page.

The problem with cause-based alerting is that there is no end to it. A slow database can be caused by IO contention, lock contention, a bad query plan, an exhausted connection pool, a failover in progress, a noisy neighbor on the hypervisor. Each one could be its own potential alert. Each one could be a 02:00 wake-up if it crosses some threshold. None of them, individually, mean a customer is being hurt.

The symptom does. "Checkout latency p99 exceeds 2 seconds for 5 minutes" means a customer is being hurt. And it fires exactly once, whether the underlying cause is one of those or all of them at once. The cause-based metrics still exist. They go on the dashboard, where the on-call looks at them once they have been paged. That is the right ordering.

Ownership, or: alerts without owners are landmines

Every alert in your system should have an owner. Not "the team," not "ops," not the implicit "whoever is on-call this week." A named individual or a small, identifiable group whose job it is to do three things.

  • Decide whether this alert still earns its place every six months.
  • Update the runbook when the alert evolves.
  • Delete it when it has stopped firing for the right reasons.

An unowned alert eventually becomes a folklore alert. It fires occasionally, nobody remembers why it was added, nobody is sure if it is safe to delete, so it stays. After a few years the runbook is 60% folklore. The on-call ignores the folklore alerts because they have been wrong before. The day a folklore alert is actually right, the on-call ignores it on reflex.

The fix is governance, which is a word that should make you suspicious in this context but which earns its keep here. Treat your alert config like code. It lives in version control. It is reviewed when it changes. It is owned by someone whose name is in the file. If you cannot point to the human who is on the hook for a given alert, that alert is technical debt by definition.

The CISO mindset: attention is the budget

None of this is really about Prometheus or PagerDuty or any specific tool. The CISO-mindset framing is simple.

Your on-call human has an attention budget. Every alert spends some of it. The budget is not infinite, and it is smaller at 02:00 than at 14:00. Design your alert system as if attention were the scarce resource it actually is.

This reframes a lot of decisions. Fewer alerts at higher quality beat more alerts at lower quality, every time. A 95%-accurate alert that fires twice a week is more valuable than a 60%-accurate alert that fires twice a day. Quietness on the pager is a feature, not a bug. It means that when something does fire, the human picks up the phone with their full attention, not a thumb-flick to dismiss.

The same framing applies, with even sharper edges, to security alerts. SIEM products are built to detect everything that might be an attack. If you do not apply the actionability test to security alerts as ruthlessly as you would to infrastructure alerts, your SOC will drown, and the actual intrusion will be the unread alert at position 4,127.

Five questions for every alert in your system

This is the audit worth running quarterly, whether your scope is a corporate SOC or a single Kubernetes cluster you maintain alone. For each alert, write down the answer.

  1. Who is paged when this fires, and at what severity? If nobody, why does it exist? If everyone, why is it a page and not a ticket?
  2. What is the documented action? If the action is "investigate," what specifically? First command, first dashboard, first hypothesis. If you cannot write the runbook, the alert is not ready.
  3. When did this alert last fire, and was the human's action correct? If "did nothing because it self-resolved," delete it or downgrade to a dashboard. If "investigated and found nothing," your threshold is wrong or your symptom is wrong.
  4. What does this alert inhibit, and what inhibits it? An alert with no inhibition relationships is probably either too coarse or duplicating something else.
  5. If I deleted this alert tomorrow, what would I miss? If the honest answer is "nothing customer-visible for at least 30 minutes," it does not belong on the pager.

An alert that survives all five questions is one you can trust at 02:00. An alert that doesn't, and most won't the first time you run this audit, is one that is making your real incidents harder to detect.

Quietness is the goal. A pager that only goes off when something real is happening is not a sign that nothing is happening. It is a sign that you have done the work.