Cloud Meltdown - Ready?

Cloud Meltdown refers to a scenario when there is a widespread impact on applications due to outage of physical hardware, software issues or any other reasons. It is a well known fact that the cloud vendors use commodity hardware that is prone to failures sometimes these failures are much more widespread than initially thought. Software is the heart of the clouds and sometimes there are bugs & aspects that show up unexpectedly. In the last couple of years we have witnessed that happen many times. Here are some of the recent ones 

Let's face it these are not going to be the last ones, so instead of looking for a more stable cloud provider, organizations should invest in building their applications with a mindset that "cloud is fragile" and goal is should be to minimize the Mean Time To Recovery in case of a meltdown.

A mature cloud native application already takes care of local failure scenarios by following best practices such as 12factor.net and practices suggested by cloud vendors. But that would NOT protect the applications from cloud meltdown, more needs to be done. Here are some thoughts on what needs to be done:

  • Adopt a multi cloud strategy

Multi cloud strategy refers to adoption of multiple cloud platform as opposed to one. If one cloud platform has a meltdown your applications may be ported to other platform to ensure availability. From cost perspective it may not be feasible so at the minimal your strategy should include a way of quickly moving your applications within the same cloud platform or to your own data center. Recently FAA signed a contract with two cloud vendors Microsoft and Amazon.

  • Be prepared

​There is nothing like being prepared.

Netflix's business model depends on the availability of its applications/servers and it makes heavy investment in ensuring that it can recover from any failure scenario. During the recent Amazon AWS meltdown Netflix was able to prevent any significant impact on its service. The reason they performed so well is because of their production testing practices referred to as chaos engineering. In this testing scheme, Netflix team injects chaotic scenarios in the production environment to ensure there is a recovery mechanism planned for any situation. This obviously is an extreme preparedness scenario and most organizations will shy away from it for various reasons - lack of funding being the most common. No matter what or how, you need to have some thoughts on how your organization will react in a doomsday scenario, even if you have no strategy you would be setting the right expectation of your business sponsorer and stakeholders.

  • Portable applications

​Build your applications such that they can be moved to any cloud platform within your set MTTR. Easier said than done but very much doable. I would not suggest taking this to heart for all your applications but something that MUST be considered for at least your business critical applications. Building such applications would mean that you will not be able to take advantage of some of the features offered by the cloud platform that may lead to an increased complexity of the application and also result in relatively higher cost. E.g., if you are building an application on AWS you may use the SQS but that would make your application non-portable to Microsoft or even on premise. To address the non-portability issue you may have to implement a solution that would utilize a COTS messaging service that you will need to manage.

  • It's doomsday, do you know where your data is?

If your data is on the cloud and their is a meltdown, no matter how well you planned the availability of your application; the most critical resource i.e. data is now at the mercy of the cloud vendor's tech staff. Plan on building applications such that data is available to get your application up and running. For reasons such as security & visibility, many enterprises are already keeping the data in their own data centers so this may not be an issue.

  • Flick of a switch release

So even if you adopted the above suggestions you are NOT ready to rollout your recovery strategy unless you build a switch that will automagically port your application. The switch here refers to the automation that would build and release your application to the target cloud platform (or your data center). There are many tools available for automating the Insfrastructure as code but there is no standard way of doing it (yet), so you would need to define the code for multiple cloud vendors.

  • Exerices Exercise Exercise

Your preparedness will be measured by not how well you have organized all the elements needed for dealing with the meltdown but by (a) how quickly you would be able to get the applications up and running & (b) how much did the outage impact your stakeholders. To ensure you are good on that unfortunate day of meltdown, exercise the scnearios like you carried out the recovery exercises for DR in yester years.

Let's not JUST keep our fingers crossed but let's be prepared. Being prepared does not mean spending $$$ it may just mean setting your stakeholders expectation even if the probability of such an event occuring is once in 2 years. 

Please do share your thoughts and comments.

PS: The term Cloud Meltdown is coined by me and not a standard terminology