During August and September, Kami experienced a number of service disruptions and performance delays. Most of these issues involved loading issues, performance slowness and errors relating to our cloud storage provider being down.
We understand that these incidents have impacted our valued users and we’re deeply sorry for those affected, especially those who are already facing challenges with this new kind of back-to-school. We sincerely regret adding any more stress to your teaching or learning experience.
Kami is committed to quickly and steadily improving our services to prevent further disruptions. Our team wants to reassure you that we are doing everything we can to ensure that Kami is as stable and reliable as possible. We promise to prioritize supporting our Kami community and to quickly communicate any changes or similar events that may arise, all with transparency.
We thank you for your patience and continued support as we continue to scale. We appreciate all that you do to help us amplify our mission, making Kami even better each and every day.
The Kami Team
All dates and times are US/Pacific Time
September 16: Conversion Provider Outage
On Wednesday, September 16 at 7:24 am, our conversion provider experienced an outage that affected some Kami users’ ability to convert certain files, including Word and Powerpoint, however PDFs and Images were unaffected. Our conversion provider reported that this issue lasted under two hours and that the problem was resolved at 9:19 pm. Kami users were able to start converting files again after our conversion provider restored stability on their end.
September 15: Issue Linked to Google Drive Service Disruption
On Tuesday, September 15 at 6:40 am, Google Drive had a service disruption that affected Kami users’ ability to access Kami files from their Google Drive. G Suite reported that affected users were unable to complete tasks that required accessing a Drive folder, including loading the Drive home page. G Suite reported that the Google Drive problem was resolved at 9:10 am, and Kami users’ Google Drive access was restored. Read more here: Google Drive Service Details
September 9: Kami Firewall Outage
Time and Description of Issue:
6:00 am – The remainder of schools came back across the USA. Because the traffic appeared to be multiple devices using the same IP sending requests to Kami at the same time, this triggered our web application firewall (WAF) to believe that there was a Distributed Denial of Service attack on our systems. Our WAF began to automatically block specific IP ranges during that time.
7:30 am – Our on-call team was alerted to an issue of unhealthy performance metrics and began diagnostics of the issue.
7:35 am – We diagnosed the WAF blocking schools issue, and began to roll out firewall configuration changes to lift the thresholds for DDoS.
7:55 am – The configuration changes came into effect and we saw a reduction of traffic being blocked.
Root Cause Analysis:
This issue was caused by a sudden rise in application traffic due to schools commencing and increasing Kami sessions.
Fix and Prevention:
To reduce the chances of the issue recurring and to minimize the impact of similar events, our engineers added more monitoring to our WAF to our monitoring system so that we are alerted to anomalies sooner. We have also added specific rules to allow good traffic to avoid DDoS triggers causing traffic to be blocked.
September 8: Issue Linked to Google Drive Service Disruption
On Tuesday, September 8 at 7:30 am, Google Drive had a service disruption that affected Kami users’ ability to access Kami files from their Google Drive. G Suite reported that affected users were able to access Google Drive, but were seeing error messages, high latency, and/or other unexpected behavior. G Suite reported that the Google Drive problem was resolved at 10:23 am, and Kami users’ Google Drive access was restored. Read more here: Google Drive Service Details
September 3: Kami Service Outage
Time and Descriptions of Issue:
7:05 am – Our systems recorded increased latency with up to 12 seconds response times for requests.
8:30 am – Our servers went down partially with http 500 errors for 15 minutes due to a database performance issue. This affected 25% of our requests. Our engineers were woken up at 8:42 am (3:42 am NZST) and started working on it. Latency continued to increase.
9:00 am – Additional New Zealand team members were woken up to address the issue.
10:40 am to 11:03 am – The service ran slowly, we saw added latency between 20 to 45 seconds. We learned from the previous downtime that we had to increase our timeouts on the frontend so that even if things were slow, we knew that Kami would eventually load.
11:03 am to 11:13 am – We saw a spike of 500 errors again affecting 10-25% of requests.
11:42 am – A fix was rolled out and we saw an immediate fix to the database performance issue.
Root Cause Analysis:
The cause was due to a complex interaction between a transaction lock between two databases, with the slower database slowing down the production database. The increased requests load we saw from 10:40 am triggered an issue with our content delivery network sending an error message to almost all requests when it’s request queues backed up. This caused a lot of unexpected error messages throughout the entire app. We have prepared a fix which we will roll out should this happen again.
Fix and Prevention:
Kami engineers were alerted at 8:42 am (3:42 am NZST) and began investigating the issue. More engineers based in New Zealand were woken up at 9:00 am (4:00 am NZST) to begin a further investigation and mitigation for the incident reported. To reduce the chances of the issue recurring and to minimize the impact of similar events, our engineers are taking the following actions:
- Automatic retry of all of our network requests when 500 errors occur.
- Built Kami uptime status information into error dialogs and help drop down so that it’s easier to notify our users of issues.
- Further, optimize the main database and further work on validating the “100x” database as a replacement to the existing production database.
- Improved performance of the Redis database by sharding load across two Redis nodes.
- Replaced the monitoring service used across Kami infrastructure to provide earlier insights into issues.
August 19: Issue Linked to Google Drive Service Disruption
Starting on August 19, 2020, from 8:55 pm, multiple G Suite and Google Cloud Platform products experienced errors, unavailability, and delivery delays. Most of these issues involved creating, uploading, copying, or delivering content. This issue affected Kami users’ ability to access their files from Google Drive, which also impacted Kami’s integration with Google Classroom. Google reported that the issues were resolved at 3:30 am August 20, 2020. Read more here: Google Drive Service Details
Multiple Incidents: Kami Reaching Google Quota Limit
Between August 17 and September 2, Kami experienced multiple incidents where we have exceeding our quota limit with Google Classroom, as well as an elevated number of API errors. These events occurred in various dates and resulted in loading issues and partial service outages.
Fix and Prevention:
Each time our quota exceeded, the Kami team was able to get in touch with the Google team to increase the limit. We have also requested substantially higher quota limits to reduce the chances of the issue recurring. Below are the dates and times of the incidents:
Google Classroom Quota Exceeded
- September 2 – 2:47 pm (lasted under 3 hours)
- August 26 – 2:58 pm (lasted under 2 hours)
- August 25 – 11:57 am (lasted under 2 hours)
- August 25 – 7:55 pm – 12:04 pm (August 26)
- August 18 – 1:18 pm – 3:39 pm
- August 18 – 5:17 pm – 12:00am (August 19)
- August 17 – 2:30 pm – 4:07 pm
Again, on behalf of the team at Kami, we thank you for bearing with us during these service disruptions. If you have any questions about this report, please reach out to our team at firstname.lastname@example.org