Logging

Log Collection Issues and Their Impact on Detection: Common Pitfalls and Fixes

2025-08-14

Detection is an inference game. If your data feed stutters or lies, the game is unwinnable. The most expensive detection rule is the one fed by missing or malformed logs.

A Taxonomy of Log Collection Failures:#

  • Source darkouts: agents crash, tokens expire, or a VLAN blocks egress—no events at all.
  • Latency and jitter: events arrive minutes late; correlation windows miss.
  • Timestamp drift: NTP skew scrambles ordering and creates false negatives.
  • Parse failures: vendor format changes or unhandled variants zero out key fields.
  • Cost‑driven sampling: intentional gaps erase exactly the low‑frequency, high‑value signals.

Symptoms in the SOC:#

  • “It fired yesterday” with identical inputs today—except the log came 90s late.
  • Duplicate events collapse; the unique one carrying the decisive field didn’t parse.
  • Analysts pivot to an entity and realize last‑seen timestamps are days old.

Quick Diagnostics You Can Automate:#

  • Time‑since‑last‑log: alert if a critical source is silent beyond an SLO (e.g., 5 minutes).
  • On‑time rate: % of events arriving within an acceptable latency budget.
  • Parse success: % of logs that populate decisive fields per source and version.
  • Field completeness: distribution of non‑null values for entities you correlate on.

Minimum Viable SLOs:#

  • Critical source on‑time ≥ 99% within 60s; parse success ≥ 98%.
  • Clock skew < 250ms across collectors and aggregators.
  • Backlog depth < 1 minute during peak hours.

Architectural Fixes That Actually Work:#

  • Agents as code: declarative rollout, health probes, and auto‑remediation.
  • Buffered ingestion: local queues; broker tiers to ride through bursts.
  • Schema evolution: versioned contracts and compatibility tests for parsers.
  • Tiered routing: send high‑value slices to hot storage; archive the rest cheaply.

Operational Playbook:#

1) Establish health dashboards tied to paging for darkouts and parse drops. 2) Run “tracer bullets” (synthetic events) every 10–15 minutes; verify end‑to‑end. 3) Add chaos drills: inject delay and drop on a canary path; measure detection resilience. 4) Quarterly parser fire drills: flip optional fields and validate survivability.

Case Sketch: HealthCorp discovered 5% of critical servers were dark due to agent drift and blocked VLANs. Coverage for lateral movement was effectively 0% on those nodes. After instrumenting health checks, buffered pipelines, and tiered routing, blind spots closed and audit trails improved.

Where tools can help: Any stack can implement synthetic tracing and chaos. If you prefer an integrated harness, VanatorX includes them—but the principles stand on their own.