Device Offline Incident RCA
Incident date: 1 May 2026
Summary
Control Center generated a burst of false device-offline alerts after heartbeat data became stale. The stale heartbeat data was caused by repeated timeouts in Control Center queries to Microsoft Service Log Analytics.
Based on available operational evidence, devices and the Data Center remained fully operational throughout. The incident affected monitoring and alerting accuracy, not device or Data Center operation.
Impact
- Multiple devices were incorrectly marked as offline in Control Center.
- Device-offline notifications were generated and later resolved automatically.
- No evidence in the reviewed logs indicates an actual device fleet outage or Data Center outage.
Timeline
08:55 CEST: Uptime metric queries to Microsoft Service Log Analytics started timing out.08:57 CEST: Monitoring status queries also started timing out.09:43 CEST: Control Center began generating device-offline notifications.10:40 CEST: Uptime metric queries recovered and fresh heartbeat data was received.10:45 CEST: Devices were automatically marked online again and resolve notifications were generated.
Root Cause
Control Center treated missing fresh heartbeat telemetry as evidence that devices had disconnected. During the incident window, queries to Microsoft Service Log Analytics repeatedly timed out, so Control Center did not receive fresh uptime metrics for affected devices. Once the last known heartbeat data became stale, the offline detection logic classified multiple devices as disconnected and triggered device-offline alerts.
Resolution
The Microsoft Service Log Analytics queries recovered, fresh uptime metrics were received, and Control Center automatically resolved the affected device-offline states.
No device-side or Data Center recovery action was identified in the reviewed logs.
Follow-Up Actions
- Improve resilience of offline detection when telemetry data is delayed or unavailable.
- Investigate the Log Analytics service outage with Microsoft.