It’s timely for us to write about this now, off the back of a global outage for Facebook last week (which included Instagram and other platforms in their portfolio).
Behind all these apps and platforms is code (servers and tech equipment). There’s a lot of code giving instructions “when someone clicks this button, show them this content”. As we develop new features, new code is written, and sometimes that code can have an error, or cause an unintended result. Most of the time this is caught in testing, but if it slips through the cracks, it can impact live users.
Last week this happened to us, and it happened out of business hours so our development team was not online. Fortunately, our amazing CTO Dan, had coincidently provided us with an escalation process (literally hours before). Our development team works so hard, we want to avoid waking them in a panic in the middle of the night unless it’s necessary, and when we do, we want them to have the information they need to get it resolved as quickly as possible. This is how that night unfolded for Larissa:
10:45pm - Larissa receives a message from a user that the app is down
10:46pm - Checks the app to see if it is down for her too. It is.
10:47pm - Reviews the CODE RED process and started the escalation process
10:50pm - Slack CODE RED to our whole team…. No answer
10:52pm - Texts our CTO and Architect…. No answer
10:54pm - Stresses out and asks her family what she should do
10:55pm - Calls Sarah in New York
10:56pm - Speaks to our Architect Chris - he’s on it!!
10:58pm - Larissa let’s our social community know everything is under control.
11:10pm - Chris has fixed the issue and all is good in the Mys Tyler world again.
Part of startup is learning as we go, and developing processes along the way. And as we grow our team, we need to be better at organizing our communications for easy reference and access.
Here’s how things work in general
Testing - each time we create a new feature, it is tested by the engineer that’s built it, then it goes through QA (Quality assurance) by another engineer. Then it’s deployed to a staging build where our team can test it. Through this process we pick up most bugs
Monitoring - this is something we don’t have set up as robustly as we’ll need to in the future. This is where a big company will have a 24/7 NOC (network operation center) watching the data, usage and looking out for anything unusual to proactively investigate. They’ll also take incoming support tickets and reactively investigate any issues.
Triage - to fix the issue you need to understand it, and you also want to get a sense of the level of urgency. Is this impacting 1 customer, or everyone. Is it OS dependant, country dependant etc. Can this be fixed ourselves, or does it require engineers. How big is the impact, can we fix it within the next week, or do we need to fix it now.
Escalation - follow the process to escalate to the right person with the right information to get it resolved as quickly as possible
Resolution - somethings (like our CODE RED above) can be fixed on the backend, if it requires an app update, then we have to do a hot fix, and release it to the App Store and Google Play, they need to review /approve any new update, so this can take a while.
This latest CODE RED was a reminder to ensure we have processes documented and within easy access to anyone in our team.
It was also a reminder that we have such an incredible team that all jumped into action.
Larissa followed the escalation process, collected the information needed and quickly whipped up some “App is down, don’t worry” graphics (thanks, Canva😜 ) and popped them on every social website we had to ensure that our users were being communicated to, as well! Tip: don’t forget your users when the app is down! And Chris jumped straight on to fix it. Overall it was a little stressful, but it was a smooth process and shows that we have such a committed team willing to jump to action no matter the time of day.
Check out Larissa’s experience with this Code RED situation here