On Monday, April 20, 2020, from 10:45 AM until 7:20 PM Eastern Time (EDT), Kami users experienced a major incident as the result of a huge spike in demand for Kami’s services. Kami server APIs were largely unresponsive for much of this period.
The Kami service is now stable again with all features functioning, including real-time synchronization of annotations, automatic saving of Google Drive changes (nb: manually saving will continue to work), and email notifications of annotations. Our teams continue to work on different solutions to ensure that further disruption is minimized.
This was the most significant incident we have ever experienced at Kami, and we are deeply sorry for the confusion and disruption this caused many of our users around the world. We particularly apologize to those who are already engaged in the all-consuming and often fraught process of remote learning during school closures. Our mission is to support teachers and students during this period, and we sincerely regret any more stress to your online teaching experience.
On behalf of the team here at Kami, we want to thank all our users for their patience and understanding, not only during this downtime but throughout this journey with remote learning.
Outage Timeline: What happened?
Monday, April 20
10:45 AM Eastern Time: Kami servers were overloaded as the result of an unexpectedly high spike in demand. Our users experienced up to 90-second delays in features involving server communication. This affected our main websites: api.kamihq.com, kamiapp.com, and blog.kamiapp.com. The Kami app automatically switched to offline mode in order to save and preserve any work being completed within Kami while the servers were unreachable. This situation lasted approximately 7 hours.
11:09 AM Eastern Time: Our engineers were alerted to the issue with auto-scaling and began to scale up servers manually. We quickly realized that the demand for our service had outstripped our COVID-19 projected auto-scaling server limits. We placed a request with our cloud infrastructure provider to lift all limits.
12:50 PM Eastern Time: The limits were lifted and we began to scale up our infrastructure. Access to blog blog.kamiapp.com and website kamapp.com were immediately fully restored.
1:00 PM Eastern Time: We began to see the restoration of the rest of the Kami service. Although it was processing as many requests as fast as possible, it appeared slow or down to everyone because it continued to be heavily overloaded.
2:00 PM Eastern Time: Our engineers realized that one of our database services, the Redis cache , had become the new cause of service degradation. With the influx of requests from offline mode coming through, the datastore quickly exhausted available memory and in turn impacted upstream processes. We re-initialized the Redis server with double the memory.
3:00 PM Eastern Time: Our engineers began to disable services that rely on Redis in order to restore service.
To this end, we disabled our Google Drive auto-save feature, email notifications, and real-time annotation synchronization, and advised users to manually save their work But this was not enough to prevent Redis from continuing to be overloaded.
We then also disabled the processing of attachments to annotations. This impacted new videos, screen recordings, audio and image annotations. These new annotations will be re-processed once our service is stabilized so that users will not lose these.
4:00 PM Eastern Time: Our engineers began working on Redis dependencies to our other datastore, Postgres, then on sharding the Redis datastore. Sharding allows us to separate out the different parts of code that rely on Redis to different Redis servers. This means that even if one of the Redis servers goes down tomorrow, it won’t impact the other Redis servers.
6:00 PM Eastern Time: Our API response times began to return to normal levels again and will continue to stabilize over the next hour. Kami engineers have identified the highest impact areas that can reduce the load to our datastore, and began working these optimizations. Our goal is to release these changes overnight before the start of the next school day; Tuesday, April 21, 2020.
7:00 PM Eastern Time: Our engineers continue to develop and implement resilience improvements to our Redis datastore.
Tuesday, April 21
1:00 AM Eastern Time: We have reactivated real-time sync, auto Google Drive syncing, email notifications, and reprocessed any multimedia annotations. We have also rolled out optimizations to the app to improve our Redis reliability. Our engineering team should be online at 10am Eastern to monitor the ramp-up of activity.
11:00 AM Eastern Time All systems stable.
More updates to come.
Key activities that were impacted by the outage:
[Resolved] Built-in ‘Turn In’ button for LMS integrations: Normally, this button enables students to submit their Kami assignments back to their LMS (Learning Management System); Google Classroom, Schoology, or Canvas. This issue has been resolved. During downtime, students were unable to use this button to submit work. Students can now go back to their Kami assignment, wait 2-3 minutes to check all work is completely synced, before continuing with the Turn In process.
[Resolved] Creating Kami assignments on Google Classroom: Creating Kami assignments in Google Classroom was as easy as a click of a button. This feature stopped working during downtime but has since been fixed and back to normal.
[Resolved] General slow performance: Using any part of the Kami service, including access to our (blog.kamiapp.com) and main website (kamiapp.com) became very slow, though it would eventually work. Any slowness should now be resolved.
[Resolved] Sign up and login issues: You may notice that signing in using Google resulted in a “Popup was blocked” message. We’re not 100% sure why this happens but we noticed a spike of this issue for some users. We believe now that our service is restored, we should not see this error happen.
[Resolved] Real-time syncing of Kami annotations: Normally, when multiple Kami users are viewing the same shared file, they can interact and see other annotations added in real-time. During downtime, this automatic feature stopped working, and annotations were not able to sync through our server. We have now resolved this issue.
[Resolved] Automatic saving (auto-save) to Google Drive: Normally, Kami will auto-save file edits to Google Drive roughly every 30 seconds. This feature was disabled during downtime, but Kami’s offline mode helped to hold any changes in your browser. Now that it’s resolved, please open Kami and all annotations will automatically sync back to your Google Drive. Users can still save their work manually by going to Save (disk icon in the menu bar) and click “Save Now.”
[Resolved] Processing of Video, Audio, Screen recording annotations: As we tried to stabilize Kami, we disabled the processing of these new annotations during the 7 hours of downtime. You may have noticed that the annotation exists, but Kami failed to load the content. We have reprocessed all of these attachments, you should not encounter any issues moving forward.
If you continue to encounter any errors, please email us at firstname.lastname@example.org. To help us provide you with the best solution, please specify the error you’re seeing and provide us with screenshots or screencast of the issue, if possible.