Blend Integration Performance Issues

Incident Report for Mace Innovations

Postmortem

10/15/2019 - Blend Integration Performance Issues Postmortem

Overview

On Tuesday, October 15th at approximately 6:58pm MST our Blend Integration started processing events from Blend slower than normal. Features impacted were:

Auto-push-loans
Auto-push-docs
Disclosure tracking updates
eConsent Updates

We identified and communicated the issue around 7:06pm MST and started troubleshooting. After investigating the issue we identified our MongoDb cluster was maxed out on CPU and the system was not responsive. Typically the cluster during peak times throughout the day does not exceed 40% utilization (50-60% normalized process cpu) historically.

Even though the cluster was not responsive we continued to receive events / webhooks from Blend and were able to capture 100% of them during the incident. We were able to reset the cluster and resume processing the queued events. We confirmed all outstanding events / tasks were completed and the service was processing events real-time at 9:26pm MST. In the days following the incident we worked with MongoDb engineers and came up with a plan to reduce the likelihood of this occurring again. See below for actions taken and changes made since the incident.

Changes Implemented

Internal alerting systems need to be enhanced to account for this scenario.
- SMS alerts sent when the following occurs
  - Normalized CPU goes above 80%
  - Disk write queue exceeds 20
- Alerts will allow us to triage/prevent an outage or performance incident before systems are non-responsive.
Queue separation / isolation / cleaning
- Core queuing technology has been moved to a new cluster to isolate from the primary cluster that is responsible for processing each event / task.
  - Allows for the scaling and addressing issues independent of the primary cluster / worker.
- Queue collection will be maintained each week to keep size to a minimum and optimize performance of querying it.

Closing

We are continuing to work with MongoDb and their executives on the product side to push for enhancements and preventative solutions while also taking steps internally to reduce the risk of failures or performance issues.

We appreciate your business and are committed to continuing to improve the performance and stability of the Encompass / Blend Integration.If you have any questions or would like further details, please email [email protected].

Posted Oct 23, 2019 - 07:28 MDT

Resolved

This incident has been resolved. Should you have any questions or need help with a specific issue please email [email protected].

Posted Oct 15, 2019 - 21:26 MDT

Monitoring

Performance Issues have been resolved and queued task are beginning to catch up. We are continuing to monitor systems closely. Another update will be provided once the queue(s) for all clients have been fully processed.

Posted Oct 15, 2019 - 21:00 MDT

Identified

We have identified the cause and are working toward resolution.

Posted Oct 15, 2019 - 20:06 MDT

Investigating

We are currently investigating this issue.

Posted Oct 15, 2019 - 19:06 MDT

This incident affected: Blend (Integration).