The full timeline of the outage and status updates can be found here.
https://status.maceinnovations.com/incidents/rjvhsvqskyw8
Throughout the day yesterday, there were periods of time where certain parts of our Blend integration was not functioning in a timely manner, or at all. We worked closely with our database vendor MongoDB and their MongoDB Atlas support team as the issue was related to our Blend MongoDB cluster. We have 3 nodes in the cluster, a Primary, Secondary and another Secondary for backup. Due to increased volume and a particular error that started happening, the primary node was getting overloaded and failing over to the secondary. The same thing happened with secondary after a period of time, usually took an hour or two for it to reach max capacity. During this bouncing back and forth and falling over, Blend events such as eConsent, sending disclosures and importing new loan applications were processing successfully for periods of time but not near real-time like usual.
As the day progressed we decided to upgrade the cluster, and essentially “threw hardware at it” until we find the root cause. During that time we paused events coming in from Blend temporarily to give the cluster time to recover. The cluster recovered, with the increased capacity, we started processing the backlog of events and got mostly caught up. However, as we were catching up what we now know as the root cause put the cluster into the same position and started failing over and we were back to the same place we were before. It took longer this time because of the increased capacity/hardware.
MongoDB, the database engine, uses what's called an OPLOG, its a log of every update, insert or delete that happens within any collection or document in MongoDB. For more information and details on the OPLOG https://docs.mongodb.com/manual/core/replica-set-oplog/. Due to the error below (root cause) the OPLOG was growing at an exponential rate causing it to effectively crash the nodes within the cluster. It was growing at over 500GB per hour.
Our database triggers were trying to insert a document into a collection that was over 16MB and kept retrying endlessly. Causing the OPLOG to grow at a very fast rate.
We resolved the issue with the documents, adding extra capacity to the cluster (25x more than usual) because we knew we needed to get through the backlog of events quickly. Things are processing, as usual, today without issue and we are back to the same capacity we were before this issue arose.
In order to prevent this from happening again, we added an alert to our over 20 alerts that monitor the health of the cluster. We now have an alert for when the OPLOG grows faster than what we consider reasonable over the period of 1 hour. Also as we continue to build new features and functionality, in general, we will also be putting in checks (software solution) that will minimize the risk of having a scenario where this error can occur.
We appreciate your business and are committed to improvements that will ensure an outage of this nature doesn’t happen again. If you have any questions or would like further details, please email [email protected].