2020-08-14 - Portal, API, SNAPMobile Outage (Resolved)
Table of Contents
Event Description: Portal, API, SNAPMobile Outage
Event Start Time: 2020-08-14 5:05 EST
Event End Time: 2020-08-14 7:23 EST
RFO Issue Date: 2020-08-19
Affected Services
Some Phones registered to Atlanta lost registration. Devices configured with SRV or UDP failed over. Devices manually registered to core1-atl did not regain registration.
Event Summary
On July 8th, 2020, the ACS cluster crashed on three separate occasions causing 30 seconds outages. Several phones lost registration and the ability to make or receive calls.
Event Timeline (All times 24-hour format, EST)
July 8th, 2020
- 14:19 ACS Cluster crashed
- 14:19 NOC team was notified
- 14:19 ACS Cluster Restored after 30 seconds
- 14:19 ACS Cluster crashed
- 14:19 NOC team was notified
- 14:19 ACS Cluster Restored after 30 seconds
- 14:20 NOC team was notified
- 14:20 First report identified by partner
- 14:21 ACS Restored along with Registration of Devices.
- 14:22 Second report identified by partner
- 15:39 Tier 3 & 4 engineers began an investigation. A report posted to Discord.
- 15:40 ACS Cluster crashed
- 15:40 NOC team was notified
- 15:40 ACS Cluster Restored after 30 seconds
Root Cause Analysis
In troubleshooting with our senior engineers, we determined that the crash was inside the RTP layer of the switch specifically on an object called CRTPRelayTap. This suggests the issue was likely during an audio tap for audio monitoring.
It has been marked a bug and will be corrected via a patch.
Future Preventative Action
The patch is undergoing testing and is due to be installed on 07/30/2020 during our maintenance window.
Update: Patch was applied on 7/30/2020 and confirmed to resolve issue.