Data Collection and App Unavailable
Incident Report for Heap
Postmortem

Outage: On Friday, August 21st Heap experienced an outage during which app functionality and data collection were unavailable. This outage lasted from 10:53am to 12:15pm PDT.

Root cause: We deployed a bad configuration change to our request proxying layer. Our deployment process is built to abort deploys which contain faulty configuration changes, however due to a bug in our deployment tool, the bad configuration was allowed to roll out to our full fleet of web-tier machines.

We have modified our deployment process so that we are no longer impacted by this issue and are investigating further preventative action.

Remediation:

We are able to reprocess the following requests made to Heap’s pixel endpoint in order to recover data:

  • All client-side web events
  • All iOS SDK events (SDK versions 6.8.1 and below)
  • iOS SDK sessions and pageviews (SDK versions 7.0.0 and above)
  • Android SDK events retried by the client after the outage window

Unfortunately we are unable to recover the following events:

  • iOS SDK non-session and non-pageview events (SDK versions 7.0.0 and above)
  • Events sent via server-side Track API; please ensure your system has retried all failed requests

We are in the process of launching this data recovery effort, and expect it to be finished no later than Friday, September 4th. We will be sure to update you in the event of any changes to this timeline! Sincere apologies from the entire Heap team for any inconvenience this has caused

Posted Aug 26, 2020 - 09:59 PDT

Resolved
App and data collection have been stable for over an hour. We will investigate the extent to which we’re able to recover data lost due to unavailability and reach out to customers directly with more information.
Posted Aug 21, 2020 - 13:54 PDT
Monitoring
We've identified a likely root cause, deployed a fix, and are monitoring to confirm resolution
Posted Aug 21, 2020 - 12:15 PDT
Update
The team is about to deploy a potential fix. We will provide an update about the status of this fix in 15 minutes.
Posted Aug 21, 2020 - 11:57 PDT
Update
The team has identified a possible cause, but are continuing investigation into the issue. We will provide an update in 20 minutes
Posted Aug 21, 2020 - 11:36 PDT
Investigating
Data collection and the app are currently unavailable. The team is investigating, and will provide an update in 30 minutes.
Posted Aug 21, 2020 - 11:06 PDT
This incident affected: Data Collection and App.