Blend Integration Down

Incident Report for Mace Innovations

Postmortem

10/3/2019 - Blend Integration Outage

Overview

On Thursday, October 3rd at approximately 10:50am EST our solution stopped processing the following within our Blend Integration.

Auto-push-loans
Auto-push-docs
Disclosure tracking updates
eConsent Updates
All other Blend webhook events

We identified the issue around 11:24pm EST however we also identified another issue with a single client and spent the next few hours identifying whether the issue was impacting all Blend Integration customers or if it was an isolated issue. Once we were able to determine the scope of the impact we reported the outage via our status center at 1:51pm EST.

While troubleshooting, it became clear a code change was made within the service providers platform that impacted all aspects of the Blend Integration, specifically the updating of tasks within the queue.

A ticket was issued to the service provider, escalated through our account manager, phone calls and email and was upgraded to an S1 Production Down severity. Incident response from the service provider:

On Friday we rolled out a release which impacted application functions using updateOne running as System users
Initial diagnosis led us to believe reported behavior was a result of a fix for bug where a typically non-permitted command was allowed against the database service
Subsequent investigation led us to identify the underlying issue and we rolled back the code impacting your application.
To rectify this we are adding testing for this case and will default to rolling back updates more quickly if a similar case arises. We have also committed to providing a testing environment This will allow you to continually test against new application code and help us identify any issues pre-release.

During the above incidents we did test various application code changes in an isolated instance to try and provide a temporary workaround and restore service. None of those attempts were successful and we had to wait until the service provider rolled back the code change.

Around 7:00pm EST the code was successfully reverted and we started to see events successfully being processed. We had approximately 13,000 events that were captured during the outage that needed to be processed. Around 7:30pm EST all events were processed and we verified we were caught up. After the service was verified to be running and caught up we took some time to verify our findings and ensure we were processing all incoming events in real-time. We communicated restoration of the Blend Integration at 7:52pm EST.

Learnings and next steps

Few areas that need immediate attention and in some cases are already implemented.

Internal alerting systems need to be enhanced to account for the scenarios in which occurred with this outage. While we do have extensive automated and manual test / reviews additional alerts and logging will be added to improve how fast we can diagnose issues.
Partnership with service provider
- Faster escalation channel(s)
  - We can now fast track any issue we maybe having to the VP of Product Development for the application platform they provide.
    - Slack
    - Email
- Code changes and reverting
  - They have committed to improving their internal process to improve the speed at which they can identify and roll back code changes.
  - We now have access to their development code branch / testing environment.
    - This will allow us to write automated testing that will be scheduled to run throughout the day, multiple times per day.
    - In the event of a breaking code change we can identify and escalate through the product manager and halt any releases from going into production that could impact our integration.

Closing

We appreciate your business and are committed to improvements that will ensure an outage of this duration doesn’t happen again. If you have any questions or would like further details, please email [email protected].

Posted Oct 07, 2019 - 17:10 MDT

Resolved

All events have been processed at this time. Items such as auto pushed docs, eConsent, disclosure tracking and auto push of loans are current and Encompass should reflect this now. We are continuing to monitor the solution closely and will be working with our service provider over the next few days to outline and provide an incident post mortem by EOB Monday, 10/7/2019.

If you have any loans that that are missing any of the items mentioned above please submit a support ticket using the mi.BlendIntegration form within Encompass.

Please email [email protected] should you need further details, have questions or concerns about today's outage before the incident post mortem is delivered.

Posted Oct 03, 2019 - 17:52 MDT

Monitoring

The service provider has deployed a fix and our service is now working through the backlog of Blend events sent to us for processing. During the outage, we captured events and data throughout the day so no data should be lost. We currently have about 13,000 events in the queue to process. We will update everyone again once we are caught up. Items such as auto pushed docs, eConsent, disclosure tracking and auto push of loans are being processed now.

Posted Oct 03, 2019 - 17:09 MDT

Update

We are continuing to work towards resolution. The nature of this issue stems from a service provider that provides application and database services for our solutions. We are working closely with them to get resolution to this issue. We will provide further updates as they become available.

Posted Oct 03, 2019 - 15:45 MDT

Update

ETA approximately 1 hour or less to resolution. We are monitoring and will communicate more information as we becomes available.

Posted Oct 03, 2019 - 13:15 MDT

Update

We are continuing to working towards resolution. We do not have an ETA yet but will provide one as more information becomes available.

Posted Oct 03, 2019 - 12:47 MDT

Update

We are continuing to work on a fix for this issue.

Posted Oct 03, 2019 - 11:52 MDT

Identified

We identified an issue that has impacted our Blend integration. We are working towards resolution and additional updates will be provided as they become available.

Posted Oct 03, 2019 - 11:51 MDT

This incident affected: Blend (Integration).