Blend Integration - Partial Outage
Incident Report for Mace Innovations
Postmortem

Overview

The full timeline of the outage and status updates can be found here.

https://status.maceinnovations.com/incidents/rjvhsvqskyw8

Throughout the day yesterday, there were periods of time where certain parts of our Blend integration was not functioning in a timely manner, or at all. We worked closely with our database vendor MongoDB and their MongoDB Atlas support team as the issue was related to our Blend MongoDB cluster. We have 3 nodes in the cluster, a Primary, Secondary and another Secondary for backup. Due to increased volume and a particular error that started happening, the primary node was getting overloaded and failing over to the secondary. The same thing happened with secondary after a period of time, usually took an hour or two for it to reach max capacity. During this bouncing back and forth and falling over, Blend events such as eConsent, sending disclosures and importing new loan applications were processing successfully for periods of time but not near real-time like usual.

As the day progressed we decided to upgrade the cluster, and essentially “threw hardware at it” until we find the root cause. During that time we paused events coming in from Blend temporarily to give the cluster time to recover. The cluster recovered, with the increased capacity, we started processing the backlog of events and got mostly caught up. However, as we were catching up what we now know as the root cause put the cluster into the same position and started failing over and we were back to the same place we were before. It took longer this time because of the increased capacity/hardware.

MongoDB, the database engine, uses what's called an OPLOG, its a log of every update, insert or delete that happens within any collection or document in MongoDB. For more information and details on the OPLOG https://docs.mongodb.com/manual/core/replica-set-oplog/. Due to the error below (root cause) the OPLOG was growing at an exponential rate causing it to effectively crash the nodes within the cluster. It was growing at over 500GB per hour.

Our database triggers were trying to insert a document into a collection that was over 16MB and kept retrying endlessly. Causing the OPLOG to grow at a very fast rate.

We resolved the issue with the documents, adding extra capacity to the cluster (25x more than usual) because we knew we needed to get through the backlog of events quickly. Things are processing, as usual, today without issue and we are back to the same capacity we were before this issue arose.

Learnings and next steps

In order to prevent this from happening again, we added an alert to our over 20 alerts that monitor the health of the cluster. We now have an alert for when the OPLOG grows faster than what we consider reasonable over the period of 1 hour. Also as we continue to build new features and functionality, in general, we will also be putting in checks (software solution) that will minimize the risk of having a scenario where this error can occur.

Closing

We appreciate your business and are committed to improvements that will ensure an outage of this nature doesn’t happen again. If you have any questions or would like further details, please email [email protected].

Posted Mar 19, 2020 - 14:52 MDT

Resolved
We have identified the root cause and resolved the issue. We will monitor this integration closely throughout the night and tomorrow. An incident post-mortem will be completed within 24 hours.
Posted Mar 18, 2020 - 23:33 MDT
Update
All loan applications that originated in Blend have been synced to Encompass at this time. We are processing new applications as they come in, no queue of applications to process at this time. Disclosures (sending to Blend) and Encompass changes pushed to Blend are also caught up and processing in near real-time. eConsent, signed disclosures (sent to Encompass) and documents (synced to Encompass) still have a queue are not near-real-time yet. We are monitoring the queue and will provide additional updates as the queue is processed.
Posted Mar 18, 2020 - 19:51 MDT
Update
During the time since the last update, we processed a significant amount of the events. We are continuing to monitor and process the backlog of events.
Posted Mar 18, 2020 - 15:50 MDT
Monitoring
The integration is operating now and we are starting to process the queue of tasks and events. You will start to see disclosures being sent to Blend, eConsent syncing, loan applications being imported over the next 30-60 minutes. We have roughly 20,000 events to catch up on from Blend. No data was lost and this doesn't require our customers to do any additional work, just be patient. We will provide additional updates as we catch up.
Posted Mar 18, 2020 - 13:35 MDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 18, 2020 - 12:52 MDT
Identified
We have identified the issue and are working towards a resolution.
Posted Mar 18, 2020 - 12:18 MDT
This incident affected: Blend (Integration).