Detection is an inference game. If your data feed stutters or lies, the game is unwinnable. The most expensive detection rule is the one fed by missing or malformed logs.
A Taxonomy of Log Collection Failures:#
- Source darkouts: agents crash, tokens expire, or a VLAN blocks egress—no events at all.
- Latency and jitter: events arrive minutes late; correlation windows miss.
- Timestamp drift: NTP skew scrambles ordering and creates false negatives.
- Parse failures: vendor format changes or unhandled variants zero out key fields.
- Cost‑driven sampling: intentional gaps erase exactly the low‑frequency, high‑value signals.
Symptoms in the SOC:#
- “It fired yesterday” with identical inputs today—except the log came 90s late.
- Duplicate events collapse; the unique one carrying the decisive field didn’t parse.
- Analysts pivot to an entity and realize last‑seen timestamps are days old.
Quick Diagnostics You Can Automate:#
- Time‑since‑last‑log: alert if a critical source is silent beyond an SLO (e.g., 5 minutes).
- On‑time rate: % of events arriving within an acceptable latency budget.
- Parse success: % of logs that populate decisive fields per source and version.
- Field completeness: distribution of non‑null values for entities you correlate on.
Minimum Viable SLOs:#
- Critical source on‑time ≥ 99% within 60s; parse success ≥ 98%.
- Clock skew < 250ms across collectors and aggregators.
- Backlog depth < 1 minute during peak hours.
Architectural Fixes That Actually Work:#
- Agents as code: declarative rollout, health probes, and auto‑remediation.
- Buffered ingestion: local queues; broker tiers to ride through bursts.
- Schema evolution: versioned contracts and compatibility tests for parsers.
- Tiered routing: send high‑value slices to hot storage; archive the rest cheaply.
Operational Playbook:#
1) Establish health dashboards tied to paging for darkouts and parse drops. 2) Run “tracer bullets” (synthetic events) every 10–15 minutes; verify end‑to‑end. 3) Add chaos drills: inject delay and drop on a canary path; measure detection resilience. 4) Quarterly parser fire drills: flip optional fields and validate survivability.
Case Sketch: HealthCorp discovered 5% of critical servers were dark due to agent drift and blocked VLANs. Coverage for lateral movement was effectively 0% on those nodes. After instrumenting health checks, buffered pipelines, and tiered routing, blind spots closed and audit trails improved.
Where tools can help: Any stack can implement synthetic tracing and chaos. If you prefer an integrated harness, VanatorX includes them—but the principles stand on their own.