Cloud Services and Availability

From the way that Cloud vendors promote their wares and how apologists fawn over these services, one would think that Cloud was the be all and end all of computing. The fact is that we’ve been here before and Cloud is just a more advanced form of central or mainframe computing from the 60s and 70s. And I’m talking about the concept, not the practical implementation.

There are a few requirements that need to be in place for Cloud to work, and work well. For example:

customers need a reliable and fast connection to the internet
cloud vendors need to provide a feature rich product set
world class security
predictable TCO or costing
a usable toolset for managing services easily and quickly
high performance
data control and privacy

Most importantly, cloud vendors need to have ‘perfect’ availability. It serves no one if customers move their workloads to the cloud but then are unable to operate 24×7 or when required due to cloud accessibility issues.

Unfortunately, cloud service outages are more common than one might think. And the impact of these outages can be severe – cloud customers who depend (often entirely) on these services, are left in the dark and operating at reduced or zero availability.

The real impact of IT cloud outages is distorted because they often go unreported. Companies proudly disclose figures calling attention to their low number of outages, but the reality is that just because they haven’t experienced a total shutdown doesn’t mean an outage hasn’t occurred – just that they have managed to keep services running at a lowered capacity.

And all cloud services have had issues over the years including AWS, Google Cloud, Azure and Oracle Cloud. There’s been some pretty severe outages over the last year as well:

IBM’s cloud went down entirely on the 9th of June – no word from them yet as to a proper event report
IBM’s cloud went partly down on 24th June (their cloud service history makes interesting reading) with service outages of up to 19 hours
Salesforce had a 12 hour outage in May last year, which eventually took 12 days to completely sort out
Almost 10% of EC2 instances and EBS volumes in the AWS US-EAST-1 AWS region were affected by a power incident leading to permanent data loss for come clients
Global iCloud users got “Service Unavailable” messages in July last year resulting the outages to most of Apple’s online services
In May, Microsoft had to face an outage that lasted for more than an hour showing network connectivity errors in Microsoft Azure that deeply affected its cloud services including Office 365, Microsoft Teams, Xbox Live, and several others which are widely used by Microsoft’s commercial customers.
In July 2019, Cloudflare visitors received 502 errors caused by a massive spike in CPU utilization on the network. The company said that the 30-minute outage was due to a CPU spike which, in turn, was caused by a bad software deploy that was rolled back
There were issues with Facebook and Instagram earlier this year which was caused due to a server configuration change. During the outage, users faced issues with Facebook-owned properties Instagram and WhatsApp for around 14 hours

This is just a short list of major outages occurring over the last year. If you take a look at the historical views of cloud service issue trackers, you’ll see that issues are occurring all the time. It’s just a matter of the nature/severity of the incident and how much redundancy the cloud provider has in their system.

Availability is not the only issue to consider – data integrity and access is as important if not more so. Cloud outages have proven that data corruption and/or loss is possible too.

So do we just stop using cloud then for critical and important workloads?

For many that have moved services into the cloud, this is not an option – they moved to the cloud for a reason, often one which precludes moving away. But you can mitigate the effects of outages by using tools that the cloud providers themselves may offer.

Such as …

Regions

A “cloud region” describes a real-life geographic location where your public cloud resources are located.

Regions allow you to locate your cloud resources close to your customers, both internal or external. The closer your customers are to the region where your cloud resources are located, the faster and better their experience will be.

Regions are also commonly used as part of a disaster recovery (DR) strategy. While many public cloud users depend on the reliability and redundancy of inter-region resources for DR, some use multiple regions to achieve the same result. Sometimes this is required for regulatory or compliance reasons, but often it is used to make sure services can be accessible even if the cloud provider has an issue in a particular region.

Availability zones (AZs)/Zones/Domains

AZs are a subdivision of a region, providing redundancy services within a region.

Geography

Azure maintains a superset of regions call a geography, mainly used for data residency and compliance.

Different data storage tiers

AWS as an example, offers a variety of tiering in their S3 storage product:

standard

reduced redundancy

Glacier

Each of these provides for a reducing level of redundancy and/or availability and/or restore capability so you can choose a configuration that suits a particular requirement.

Direct Connect/ExpressRoute

There are varying configurations of the above direct/p-p cloud connectors (via clients’ onsite networks to a local ISP that is directly tiered to the Cloud provider) providing for different levels of redundancy and performance. In addition, one can provide additional backup routes via internet-based managed VPN connections.

Security Features

Security is another critical component of availability – all cloud providers have tools for protecting the services that you consume in the cloud, from security groups to WAFs. These tools need to be configured in conjunction with your availability design to provide a holistic solution that is protected from attacks. This is no different from any other type of solution such as hybrid-cloud or on-premise.

It’s important when planning a cloud deployment, that you take availability and redundancy into account. By spreading your workload across increasing supersets of provider infrastructure and connectivity, you are improving the chances of surviving outages.

Disaster Recover (DR) and Business Continuity (BCS) are as important here as in any other design.

The ultimate availability choice is then to operate your cloud services across multiple cloud providers. A pricey option, but one that should provide almost 100% availability.

In the end, you have to remember that cloud providers are not infallible, and that you need to design carefully, work around potential issues and make use of the provided tools and services that can improve availability.