We had a pretty big incident today which resulted in the app being down for 4 hours, and some data being lost. I want to go over what happened, what was recovered, and what was sadly lost for good.
A month ago, we introduced entity recovery for boosted campaigns. This means that any entity deleted would be hidden from the application, regardless of boosted status, and be recoverable for up to 30 days. This night at 2 in the morning, the first batch of entities flagged for deletion were permanently removed from the database.
What we did not see during testing, is that entities that are nested (for example locations can be inside parent locations) would also be deleted by this script. This is a problem that the app also faced two years ago when nested entities were introduced, but that we no longer had this issue in mind while testing entity deletion.
However, this is only one part of the problem. Another issue is due to how the nested trees work. Each campaign has it's own "tree" of locations, organisations, families etc. However, the removal script forgot to take that into account. So when deleting a location tree, it deleted all children that shared a similar identifying number, regardless of if it was in the same campaign or not.
What We Restored
We were able to restore the database as it was on Saturday the 2nd of May 2020 at 6:00 AM UTC. The reason we could not recover the backup from Sunday is that the removal script was set to run before the backup.
This morning, over the course of four hours, we manually restored the data that was deleted in the past 24 hours to the way they were on Saturday morning, rebuilding the various entity trees.
All in all, we were able to restore:
What was lost
Sadly, this brings us to what was lost. If an entity was created in the last 24 hours AND was impacted by the bug it is lost. Any entity that was created earlier but edited during the last 24 hours and was inadvertently deleted has been reset to the state it was in 24 hours ago. This does not impact attributes, entity files, entity reminders, relations or entity abilities.
Looking at the difference between the backup and the live database, we can see the following entities were definitively lost:
What we'll change
To begin with, we have deactivated the entity deletion script for now, until a proper fix is ready. Secondly, when it is re-activated, it'll take place after the daily database backup, just in case.
Regarding database backups, we are going to move to a new server provider as soon as we can. Our current one limits us to 1 backup a day, while the new one will allow us complete freedom. Database backups are always system intensive and impact all users while they are running, but we feel confident that with new servers we can provide at least 2 daily backups with minimal disruption.
Lastly, we will learn from this experience and implement changes into the way we test new features. In this example, we won't have the entity removal script run after 30 days on the test server, but have it occur more frequently. We'll also set up bigger testing campaigns to be able to notice missing data.
Words cannot describe how awful we feel about this. Waking up to your emails and messages on Discord about something having gone wrong was by far the lowest point for Kanka this year. We do not underestimate how frustrating it must be for each and every one of you that has lost any data due to this issue, and we can only apologize deeply for all inconvenience caused.
While these teething issues are bound to happen for any growing software, we do not take your time or commitment lightly. We will continue to do our best to improve and to be fully transparent along the way.
Once again we are truly sorry, and we hope that this won't prevent you from enjoying Kanka.