Elevated System Latency
Incident Report for Sertifi
Postmortem

On Wednesday, a Production Load Balancer and internal router began to hit resource exhaustion, causing dropped connections that escalated latency on calls within our application. Our response was slowed as we started investigating whether the latency was attributed to an earlier production deployment, in addition to the recent inclusion of new threat monitoring rules on our firewall. 

We iteratively turned off the threat monitoring rules in an attempt to isolate any rules that were introducing latency. While there was some improvement, we did not see full site recovery. We then elected to perform a full resource cleanup job at 6:00 p.m. CDT, when site traffic is low, to reduce the risk of users experiencing delays as the load balancer rebooted. Following the cleanup job, the remaining latency cleared. 

To mitigate moving forward, we are increasing the frequency of resource cleanup jobs on the impacted Load Balancer. The permanent mitigation is to complete a phased mitigation onto new, more performant Load Balancers. This work is actively in progress.

Posted Sep 23, 2022 - 14:07 CDT

Resolved
After monitoring the application through our high volume periods today, we have validated the changes made yesterday evening have resolved the latency issues.
Posted Sep 22, 2022 - 14:36 CDT
Update
We implemented an additional change to a production load balancer yesterday evening to address the latency experienced on Wednesday 9/21. We will keep this incident in Monitoring state through the morning to validate the application remains performant before declaring the incident resolved.
Posted Sep 22, 2022 - 09:10 CDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 21, 2022 - 17:03 CDT
Investigating
We are seeing an increase in latency again and are investigating.
Posted Sep 21, 2022 - 16:44 CDT
Monitoring
After implementing a fix, site performance has returned to normal levels.
Posted Sep 21, 2022 - 14:57 CDT
Identified
We have identified a contributing issue and have implemented a fix.
Posted Sep 21, 2022 - 14:56 CDT
Investigating
We are experiencing higher system latency and are investigating.
Posted Sep 21, 2022 - 13:33 CDT
This incident affected: Web Portal and API.