On Tuesday, October 15th at approximately 6:58pm MST our Blend Integration started processing events from Blend slower than normal. Features impacted were:
We identified and communicated the issue around 7:06pm MST and started troubleshooting. After investigating the issue we identified our MongoDb cluster was maxed out on CPU and the system was not responsive. Typically the cluster during peak times throughout the day does not exceed 40% utilization (50-60% normalized process cpu) historically.
Even though the cluster was not responsive we continued to receive events / webhooks from Blend and were able to capture 100% of them during the incident. We were able to reset the cluster and resume processing the queued events. We confirmed all outstanding events / tasks were completed and the service was processing events real-time at 9:26pm MST. In the days following the incident we worked with MongoDb engineers and came up with a plan to reduce the likelihood of this occurring again. See below for actions taken and changes made since the incident.
Internal alerting systems need to be enhanced to account for this scenario.
SMS alerts sent when the following occurs
Alerts will allow us to triage/prevent an outage or performance incident before systems are non-responsive.
Queue separation / isolation / cleaning
Core queuing technology has been moved to a new cluster to isolate from the primary cluster that is responsible for processing each event / task.
Queue collection will be maintained each week to keep size to a minimum and optimize performance of querying it.
We are continuing to work with MongoDb and their executives on the product side to push for enhancements and preventative solutions while also taking steps internally to reduce the risk of failures or performance issues.
We appreciate your business and are committed to continuing to improve the performance and stability of the Encompass / Blend Integration.If you have any questions or would like further details, please email [email protected].