VanatorX | Adversary Emulation & Detection Engineering Platform

Alert counts are a trap. If you can’t explain what an alert means, how fast you saw it, and whether it would still fire under stress, you are not measuring detection quality.

The Problem With Counting Alerts:#

A high number of alerts may indicate noisy rules, not better detection. A low number can mean effective filtering—or a blind SOC. Alert volume is a vanity metric without context.

Observable Symptoms:#

Weekly reports celebrate “alerts processed” but investigations stall on missing context.
Teams fight over whether a rule “works” because no shared definitions exist for coverage or fidelity.
After a schema change, detection “success” remains flat while true positives drop.

Hidden Failure Modes:#

Coverage illusion: rules match a narrow subset of a TTP and miss realistic variants.
Time distortion: ingestion lag makes MTTD look fine when measured at the SIEM, not at the edge.
Fragility blindness: rules pass a clean test but collapse with 30–60s log delay.

Dimensions That Actually Describe Quality:#

Coverage & Completeness: - Definition: % of prioritized TTP steps that produce at least one reliable alert. - Nuance: weight by risk and environmental prevalence; include data source health.

Precision & Fidelity: - Definition: signal purity (1 − FP rate) and actionability (context attached). - Evidence: triage time distributions, analyst “would page again” votes, duplicate suppression rate.

Timeliness & Velocity: - Definition: MTTD measured at the point of action (kernel, EDR, API), not post‑aggregation. - Include: end‑to‑end latency, clock skew, and backlog depth.

Resilience & Robustness: - Definition: retained coverage/precision when chaos is injected (drops, jitter, burst load). - Include: parser survivability across vendor updates and app deploys.

Computing a Composite Detection Quality Score (CDQS):#

Normalize each dimension to 0–100.
Example weights: Coverage 0.4, Precision 0.25, Timeliness 0.2, Resilience 0.15.
Report the score and the four components; never the aggregate alone.

Data You Need (and How to Get It):#

Ground truth traces from emulations and past incidents.
Health metrics per source: on‑time %, parse success, field completeness.
Replay harness to measure true latency and rule triggering under stress.

Dashboard Anatomy:#

Top: CDQS trend with seasonal annotations (releases, migrations).
Middle: Coverage by ATT&CK tactic with gaps heatmap.
Bottom: Latency waterfall (edge → bus → SIEM) and resilience scatter under injected chaos.

Implementation Pitfalls:#

Over‑averaging: a few high‑volume low‑risk detections can drown high‑risk misses.
Ignoring benign replay: precision cannot be inferred without clean negative tests.
Skipping resilience runs: your “green” becomes red the moment logs jitter.

A Minimal Playbook:#

1) Pick 12 TTPs tied to your crown jewels. 2) Define per‑TTP coverage, fidelity, and latency acceptance criteria. 3) Run monthly replays; once a quarter, add chaos (delay, drops, format flip). 4) Publish CDQS with before/after deltas and specific remediations.

Case Sketch: A retailer saw “detection OK” dashboards while card‑skimming rules silently failed after a tag rename in the app logs. Coverage fell from 74% → 41%. After adding data‑health metrics, resilient coverage recovered to 68% under chaos and MTTD improved by 40%.

Where tools can help: You can stitch this together with your own harnesses. If you want an integrated path for replay, chaos, and measurement, VanatorX offers one—but the model works anywhere.

← Back to Blog

Measuring Detection Quality Beyond Alert Volume: Composite Scoring Dimensions That Reflect Meaningful Detection Performance