Achieving High Availability in the Cloud: The Quest for “Five Nines”

It’s no secret that downtime is costly for cloud-based platforms, which is why most SaaS companies include a service level agreement (SLA) that includes a percentage of uptime availability. In the data center industry and with other IT companies, the SLA should be verified through public accounting which includes multiple third-party monitoring groups, as companies may misstate or misrepresent their service uptime figures as false advertising. Consequently, there is considerable debate over how to calculate data center uptime and the standards of representation required.

Many business owners engaged in ecommerce and SaaS operations cannot afford to lose a single customer or sale, as any downtime will contribute to the loss of brand reputation for a software product in the marketplace, and many are willing to pay extra for an SLA that offers 99.999% guaranteed uptime. Achieving these high uptimes, however, requires more than just technology resources. It also requires careful planning and design of applications across public cloud networks.

This article will discuss the difficulties that software publishers face in attaining the “Five Nines” standard as well as how cloud solutions like elastic web server platforms and container virtualization are combined with multiple-data center and service provider hardware resources for the highest standards of online operations.

Defining the System Variables

The High Availability (HA) debate primarily centers around the issue between “Two Nine Five” (99.5%) and “Five Nines” (99.999%) uptime guarantees in a SLA. A “Five Nines” guarantee means only around 5 minutes per year of downtime for a SaaS product, website, or mobile application. That’s 26 seconds per month. This is one of the most difficult issues in online application support for cloud businesses as even minor security upgrades and patches may require minutes of downtime per month. Usually these service requirements are scheduled for times when most of the customers are offline, but this is not possible in 24/7/365 applications that are used globally by millions of users at a time.

Online companies with downtime lose brand reputation quickly and must bear significant business losses. These can even lead to legal repercussions in case of negative business impact or violation of an SLA. With “Two Nine Five” there is 1.84 DAYS(!) of downtime per year or 3.65 hours per month. Still, if this is scheduled late at night in the area of peak market activity, it may not be as noticeable for consumers.

Imagine the loss of business revenue if Amazon or eBay went offline for four hours at a time. The harm that this would cause to their business reputation is much more severe in case of a hosting company like AWS. If AWS, Rackspace, or GoDaddy crash, tens or hundreds of millions of websites can be taken offline at the same time. Even a 4-hour downtime period month after month would not be acceptable to customers.

IT pros in enterprise corporations must deal with the same requirements in daily operations. If their network operations fail, the entire business operations go down, leading to thousands of people or industrial processes without software support. According to Werner Vogels, Amazon CTO, “Sometimes two nines of uptime is fine. If you want three, you can add multiple availability zones, and impose master-slave relationships for databases. Even five nines—only minutes of downtime a year—can be achieved by leveraging multiple regions, and services like the DynamoDB fully managed database. But it’s the business rules that need to decide which availability scenario you want. Higher availability comes at a higher price.”

While anyone in the industry knows that “Five Nine” uptime is very hard to achieve in any data center, software, or IT service, many SaaS product development companies have decided that they will not tolerate high levels of downtime in services and promised to maintain higher standards in their SLAs. However, they could not do this if the technology did not exist for it.

In the next section, we will discuss what you need to plan for to attain “Five Nines” and how multi-cloud solutions are one of the key factors that enable failover support in production operations.

Attaining “Five Nine” Uptime: Multi-Cloud Solutions

Considering the requirements of less than 5 minutes of total downtime per year as the “Five Nine” standard, there are many variables that can contribute to the failure of servers or data center services. Elastic cloud web server frameworks seek to provide automated failover support for web/mobile services in high traffic environments by being designed in advance with over-capacity, regularly scheduled backups, and database synchronization. For example, if a single server or VM fails in an AWS EC2, CoreOS, or Kubernetes runtime, there is another instance open and a new VM is spun up in replacement. Depending on the hardware, some of the in-memory data may be lost, but the main point is that elastic web server frameworks are designed specifically to recover from failures gracefully with data retention while serving millions of users at a time.

However, most downtime is not coming from system overload and web server crashes, as hardware failure in the data center is a relatively rare occurrence. Almost every company has already installed data backup and recovery solutions in their facilities, many of which include offsite backups. In order to attain “Five Nine” uptime standards, it is generally accepted that IT administrators must not only plan for single server failure, or failure of subsidiary equipment like hard drives, RAM, and power supplies, but they also must plan for complete data center failure. The advantage of cloud services like AWS, Google, and other hosting companies is that you can provision resources in multiple data centers as a means of continuity of operations against these failures.

Consider the vectors of failure in a “Five Nine” High Availability agreement:

  • The software can malfunction in the application layer, causing instances of downtime.
  • The web server operating system may crash, due to overload or software problems.
  • Administrators need to install regular patches, updates, and new versions to software.
  • The web server hardware, or any of the subcomponents, can fail mechanically.
  • The data center power supply can fail, leading to backup power requirements in facilities.
  • The fiber optic providers of high speed internet connections can fail, leading to loss of services.
  • Hacking, viruses, and DDoS attacks can strike to shut down services unexpectedly.
  • Natural disasters or loss of public facilities in a region can shut down data center operations.

In order to maintain “Five Nine” uptime agreements, it is required to plan for the loss of single servers, entire data centers, electric power, and telecom providers. This requires secondary hardware, backup power supplies, multiple fiber providers per data center facility, and the use of multiple data centers. Kubernetes, AWS EC2, and other elastic container solutions make it possible to roll out system changes to web servers for security updates or feature launches in stages so to avoid any required downtime. AWS, Google, and most other cloud hosting companies offer multiple data center options in order to architect business IT resources for “Five Nine” requirements. Multi-cloud enables the use of multiple vendors, so even if AWS or Google fail, business services are sustained on running hardware maintained in a different data center.