There are very few of us who aren’t impacted when major cloud service providers experience failures. Either our productivity is affected when cloud-based business tools go down, or we are inconvenienced when consumer services like video or music streaming are interrupted.
Businesses and individuals today are more dependent than ever on cloud services. And the companies that deliver them — SaaS platforms that perform business-critical functions for many companies, and IoT companies that deliver services like home automation — rely on cloud platforms like AWS S3 storage to generate revenue and deliver 24/7, uninterrupted services to their customers. There’s no good time for downtime, especially in the cloud.
While we might not be able to plan for everything, there are steps we can take to mitigate the impact to our customers.
Top 5 things you can do now:
1. Understand all cloud services that are being used by your business
You have chosen the cloud services used by your business carefully. But has your reliance on them grown over time? Do you have a complete picture of how heavily you rely on them, and what would happen if you had to do without them?
It’s a good idea to keep track of all cloud services used by each application and understand the impact to the business if that service fails. And keep your cloud services records updated regularly, because, as your business changes, so will the reliance on those services.
2. Avoid single point of failure
For all layers of an architecture, avoid single points of failure. This is the golden rule for designing for high availability in our applications, databases and networking.
And this is no different for the cloud services that your business uses every day. Most cloud services build high availability into them, even offering certain SLAs. For example, AWS S3’s SLA is 99.9% for standard S3, and 99.0% for infrequent access storage.
But be careful of complacency here. Cloud services are black boxes, and if you rely only on the high availability built into the service, you are exposing yourself to a single point of failure. And this leads us to the next step.
3. Utilize multiple regions
One primary benefit of cloud providers is that they are able to operate their infrastructure in many geographically distributed data centers much more cost-effectively than any business could by themselves.
This is an important tool to use when planning for disaster recovery and business continuity, in case of natural or man-made disasters taking out entire regions.
Make sure to plan for being able to use services in different regions, even if it is not an automatic switch.
4. Test for failure
Netflix is one company that does a great job of testing for failure in their cloud environment. By using the tools they developed into Simian Army, which includes Chaos Monkey, they are able to test and therefore improve the resiliency of their cloud-based architecture.
While the approach might seem reckless for your business, there are places where you can begin:
- With your list of important services, develop a list of failure scenarios.
- Test failure scenarios on a regular basis, even if that means scheduling during less-busy periods.
- Ensure that your monitoring can catch these failures. Monitoring within the cloud environment works only as well as the cloud environment works. Ensure that your operational visibility strategy includes the ability to monitor itself from outside of the environment.
- As your team becomes proficient at handling these scheduled failures, consider working in some controlled randomness to the testing.
In the end, even a single cloud provider can be considered a single point of failure. While extremely unlikely, the cloud provider could stop services tomorrow. What would you do?
Don’t put all of your eggs in one basket. Consider having a backup strategy that utilizes a different cloud provider or a non-cloud data center.
Do not underestimate the complexity of such a task. Many environments are too complex to maintain similar infrastructures across multiple providers. And the risk of losing an entire cloud provider likely will not justify the expense.
However, at the very least, it is highly recommended to store your mission critical data in an ‘offsite’ location. And going through the process of understanding what it would look like to lift and shift your entire infrastructure would provide valuable insight into the time and effort required of such a scenario.
Making use of cloud services has major advantages in allowing you to scale your applications easily and cost-effectively. But this can leaves you vulnerable.
Understanding your reliance on cloud services and knowing the potential impact of losing access to each service is the first important step in being able to survive cloud outages.
Pythian can help you mitigate your risks in the cloud. Our cloud experts are experienced with all of the top cloud providers and are ready to assist you with developing strategies to ensure resiliency in your cloud environment. Contact us to get the cloud expertise you need to protect yourself before the next outage.