Postmortem:
14 minutes in eu-fra-1.
On March 3rd at 09:14:22 UTC, packet loss between two zones in eu-fra-1 climbed to 38% and stayed there for 14 minutes. This post is the unedited postmortem we wrote internally. We're publishing it because we said we would, and because the cause is the kind of thing every operator should know about.
The summary: a configuration change to BFD multipath aggregation timers was deployed to the Frankfurt route reflectors. The change was correct on three of the four reflectors and one digit off on the fourth. That fourth reflector started declaring sessions down whenever a single-packet loss spike occurred, which caused route flapping that propagated into the FIB. Customer-visible symptoms were intermittent timeouts and slow responses for traffic transiting between zones 1a and 1c.
We owe affected accounts an SLA credit; those credits were applied automatically by March 5th. If yours wasn't, please reach hello@arveon.space and we'll backdate it.
Timeline
All times in UTC. Annotations in italic are commentary added during this writeup.
09:11:00 Change CR-2026-3148 deployed to reflectors fra-rr-{1,2,3,4}. BFD multipath timer adjusted from 1500ms to 150ms. 09:14:22 fra-rr-4 declares 28 BFD sessions down within 200ms. 09:14:24 Routes withdrawn; flap-damping kicks in but with a too-low penalty. 09:14:30 Customer impact begins. Inter-AZ p99 latency: 2.4s. 09:15:11 PagerDuty alert: "fra anycast egress: 38% loss" 09:18:42 On-call (yours truly) acks. // I had logged off at 09:09 to brew coffee. 09:21:03 Hypothesis 1 (line card): ruled out. No card alarms. 09:24:18 Hypothesis 2 (peering point): ruled out. DE-CIX healthy. 09:25:50 Hypothesis 3 (config drift): confirmed via etcd diff. 09:26:14 Rollback applied to fra-rr-4. 09:28:02 All BFD sessions stable. p99 latency back to 0.42ms. 09:42:00 All customer traffic confirmed healthy. Incident closed.
14 minutes from impact to recovery. Three of those minutes are the on-call ack delay, which I'll come back to.
Root cause
Our BFD multipath timer was set to 1500ms across the fleet — generous by network standards, but a deliberate choice to absorb microbursts at our peering points. The intent of CR-2026-3148 was to bring it down to a more aggressive 150ms on route reflectors, where micro-bursts shouldn't matter and faster fault detection should help.
The change was rolled via an Ansible playbook that templates a Jinja file. The playbook ran cleanly on three of four hosts. On the fourth, an unrelated copy-paste mistake from the previous quarter had introduced a stray {{ inventory_hostname }} in the multipath.j2 template, where it should have been multipath_timer_ms. The rendered config produced a numeric 15 ms timer, not 150. (How fra-rr-4 stringified to 15 is a separate, sad story — TL;DR: hostname-based routing in our fleet's config has trailing-digit overflow guards that are too clever.)
The proximate cause was a too-aggressive BFD timer. The deeper cause was that we treated the templating layer as static enough not to test, on a hot path that was being changed.
Why it took 14 minutes
Three things contributed to the 14-minute window, in roughly equal measure:
- Detection lag (1m 49s). Our anycast loss probe samples once per second and fires at 60s of sustained loss. We've since dropped that to 20s.
- Acknowledgement delay (3m 31s). The on-call (me) was AFK for a personal coffee break of ~6 minutes, longer than the page acknowledgement window. The escalation policy worked correctly, but we didn't have a secondary on-call configured for SEV-2s during that window.
- Diagnosis (7m 5s). We chased two wrong hypotheses before identifying the config drift. Each hypothesis took longer than it should have because our runbook for "BFD-induced flapping" lived in a Notion document that was last updated in 2024.
What we changed
The fix for the immediate bug was three lines of YAML and shipped within an hour. The structural changes took longer.
Templating coverage
We added a CI step that renders every host's full network config from inventory and diffs it against a checked-in golden file. Every PR that touches templates now requires the golden file to be regenerated and reviewed. We landed this on March 7th.
Probe sensitivity
The 60s sustained-loss alert threshold dropped to 20s. We also added a secondary alert that fires on BFD session-count delta > 10 within a rolling 5-minute window — independent of customer-visible loss. This would have paged us at 09:14:24, six minutes earlier.
Secondary on-call for SEV-2
The escalation policy now requires a secondary on-call for SEV-2 incidents (previously: only SEV-1). Acknowledgement deadline dropped from 5 to 2 minutes. Yes, this is a bigger ask of the on-call rotation; we've also expanded that rotation by two engineers to absorb it.
Runbook reorganization
Our network runbooks now live in the same Git repo as the network code itself, with required reviewers from SRE on every change. The "BFD flapping" runbook was rewritten on March 11th and tested in a tabletop on March 18th.
Things I want to be clear about
- This wasn't caused by a peering issue, a hyperscaler dependency, or a hardware failure. It was a config bug we introduced.
- No customer data was lost or exposed. Connectivity was degraded for in-flight requests; idempotent operations succeeded on retry.
- SLA credits were applied automatically. If your account didn't receive one and you believe you were affected, email us. Receipts attached.
- The engineer who deployed CR-2026-3148 followed the change-management process correctly. The process didn't catch the bug. We've fixed the process. We have not punished the engineer.
What I'm taking from this
Two things, plainly.
First: config is code. We've said this for years and still didn't have a render-and-diff CI step on the most load-bearing template in the network. We do now.
Second: microbursts are real, and tight BFD timers are an attractive nuisance. Aggressive fault detection is a good idea on internal route reflectors, but only when paired with damping that's calibrated to actual link behavior. We've published our new BFD/damping config in a follow-up post if you operate something similar.
Thanks to the customers who reached out during the incident — your reports helped us correlate symptoms faster than our internal probes did. Thanks to the on-call who took over from me when I realized I was the bottleneck. And thanks to the small Frankfurt team who stayed late after the incident to write the runbooks we should have written months ago.
If you have questions or want to dig deeper into any of the technical details, my email is in the author bio below.
— Mikko