What workloads you must move to the Cloud – Part 2 (for application resilience)
What is application resilience?
How do we define application resilience! Application resilience is the ability of the application to tackle problems in one or more of its components without any degradation in its quality of service.
A resilient application should be able to elastically handle failures and recover to its original state in a minimum time period. Resilience ensures that an application runs perfectly all the time, in other words, it makes the application reliable. Thus, reliability is the outcome, while resilience is the way to that outcome.
What are the fundamental elements that build up application resilience?
The key capabilities that help to build the application resilience are:
- Availability
- Disaster Recovery
- Automatic Failover
Let’s look into each one of these capabilities, and how they are connected in the context of an application as a whole:
1. Application Availability
According to Techopedia’s definition, application availability is the extent to which an application is operational, functional and usable for completing or fulfilling a user’s or business’s requirements.
An application needs all its components to be available to be able to respond to requests in a timely and expected manner. Application availability not close enough to 100% will result in reduced application-reliability and user satisfaction which will eventually impact business results.
2. Disaster Recovery
A disaster recovery plan (DRP) is a documented plan of action with specific instructions on what should be done to respond to unplanned incidents. It is guidance for proactive risk management to minimize the impacts of potential disasters. This proactive risk management will help organizations to recover quickly and resume normal operations of mission-critical functions when disaster hits.
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are two of the most important parameters of a disaster recovery plan. These two objectives guide enterprises in adopting the right backup strategy for disaster recovery.
Below are definitions and examples of RPO and RTO from the druva blog post.
Recovery Point Objective (RPO) describes the maximum interval of time that an outage or disaster may last without any business impact due to loss of data. In other words, if an outage or disaster persists beyond this time interval, as defined by the RPO, then there will be unavoiadable and unacceptable business impact due to loss of data.
If you have a disaster or system outage of any kind, with 5 hours set as the RPO in your DRP, then this means that you can afford to loose upto 5 hours worth of data without business impact. If the system outage lasts for anytime more than 5 hours then there will be potential impact to your business.
Recovery Time Objective (RTO) is the duration of time within which application systems must be restored and business operations resumed to avoid unacceptable business consequences or impacts due to the outage or disaster in the organization.
While RPO designates the variable amount of data that will be lost or will have to be re-entered during network downtime, RTO designates the amount of “real-time” that can pass before the disruption begins to seriously and unacceptably impede the flow of normal business operations.
3. Automatic Failover
Automatic failover is the process in which an application automatically moves to its stand-by servers or components whenever failure happens to any of its primary servers or components, thereby helping to eliminate application downtime.
Such ‘failover’ can be designed to happen either to ensure ‘high availability’ or ‘disaster recovery’. It normally depends on the business-specific intent and the location of the standby servers or components.
Why application resilience is a challenge on-premise?
In today’s world, it is imperative for businesses to keep IT ecosystems up and running with nearly 24 X 7 X 365 availability, 100% fault-tolerance, near-zero latency, and seamless business continuity. Thus maintaining high application resilience of enterprise-scale is no longer a “nice to have” but a survival strategy for most businesses, without which there is a high potential for system unavailability over extended periods of time which can be disastrous from a business standpoint.
According to Gartner, the average cost of IT downtime is $5,600 per minute.
What do we need to build high resilience, and what is the challenge?
We need multiple data centers, set up enterprise-grade DR solutions (with near-zero RPOs and RTOs), and design systems to automatically redirect workloads to standby servers in alternate data centers/ availability zones or DR sites in case of application failures.
Getting all of this up and running on-premise is super-complex and super-expensive. It requires high technical know-how and heavy capital expenditures (CapEx) and/or expensive third-party software products and services. This is what makes it challenging to set up enterprise-scale reliability-infrastructure for applications on-premise.
Thus, most organizations with on-premise applications fail to design for failure. They stay under-protected and vulnerable to failures. Their inability to manage disaster when that hits can cost them their businesses.
Cloud is the answer to this challenge.
How the cloud helps in configuring high resilience for enterprise applications?
Cloud platforms provide best-in-class availability, enterprise-grade disaster recovery and intelligent failover management at a fraction of the on-premise cost to support an enterprise-scale application resilience. Let’s get into a little more detail for each one of these capabilities looking at them the AWS lens.
1. High Availability:
Resources in the cloud platforms are hosted globally in multiple locations across the world. In the AWS cloud platform, these locations are called Regions. Each AWS Region is an independent and separate geographic area, containing further multiple, isolated locations known as Availability Zones.
AWS resources are hosted in these Availability Zones that belong to its global set of Regions.
Here is the link to an interactive map from AWS documentation that vividly illustrates the AWs’s global infrastructures including its Regions and Availability Zones.
Amazon operates state-of-the-art, highly-available data centers in each of the Availability Zones belonging to each of its Regions. This helps to maintain the high redundancy of servers across a wide geographic distribution. Such a wide geographical distribution of servers provides a high level of redundancy to enable high availability of your applications that that are potentially run by these servers.
Let’s see the different AWS services that are responsible for the high availability of the different components of a modern web application on the cloud.
Content Delivery Network (AWS Cloudfront)
Amazon CloudFront is a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency, high transfer speeds, all within a developer-friendly environment.
CloudFront can be set up with multiple origins grouped into an Origin Group in which you designate a primary origin for CloudFront plus a secondary or standby origin. This ensures high availability of CloudFront service. The CloudFront automatically switches to the standby origin when the primary origin returns specific HTTP status code failure responses.
Please refer to the detail AWS documentation for the optimization of CloudFront failover for high availability.
Below is a typical CloudFront distribution with multiple origins.
Object storage (AWS S3)
Amazon S3 provides a highly available, scalable, super-fast, and inexpensive data storage infrastructure that is part of Amazon’s global infrastructure. Amazon itself uses S3 to host its websites globally.
The S3 Standard storage class is designed for 99.99% availability, the S3 Standard-IA storage class is designed for 99.9% availability, and the S3 One Zone-IA storage class is designed for 99.5% availability. Here is the link to the AWS documentation that describes the different S3 storage classes.
All of these storage classes are backed by the Amazon S3 Service Level Agreement.
Virtual Servers (AWS EC2 instances)
What if your virtual server, running a mission-critical application or a high-traffic website, fails or crashes for any reason, or unable to meet the volume of users’ requests due to a sudden and large spike in incoming traffic or for traffic that has increased gradually over time!
Please refer to “How to increase the Availability of Your Application on Amazon EC2“.
Below are the reference architecture and excerpts from AWS documentation that illustrate configurations with EC2 instances for high availability and auto-scaling.
You can launch multiple EC2 instances from your AMI and then use Elastic Load Balancing to distribute incoming traffic for your application across these EC2 instances. This increases the availability of your application. Placing your instances in multiple Availability Zones also improves the fault tolerance in your application. If one Availability Zone experiences an outage, traffic is routed to the other Availability Zone.
You can use Amazon EC2 Auto Scaling to maintain a minimum number of running instances for your application at all times. Amazon EC2 Auto Scaling can detect when your instance or application is unhealthy and replace it automatically to maintain the availability of your application. You can also use Amazon EC2 Auto Scaling to scale your Amazon EC2 capacity up or down automatically based on demand, using criteria that you specify.
Relational Database Services (RDS)
Below are excerpts from AWS documentation how RDS supports high availability and automatic failover – Amazon RDS High Availability
Amazon Relational Database Service (Amazon RDS) supports two easy-to-use options for ensuring High Availability of your relational database.
AWS RDS
For your MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server database (DB) instances, you can use Amazon RDS Multi-AZ deployments. When you provision a Multi-AZ DB instance, Amazon RDS automatically creates a primary DB instance and synchronously replicates the data to a standby instance in a different Availability Zone (AZ). In case of an infrastructure failure, Amazon RDS performs an automatic failover to the standby DB instance. Since the endpoint for your DB instance remains the same after a failover, your application can resume database operation without the need for manual administrative intervention. Learn more >>
Amazon Aurora
The Amazon Aurora PostgreSQL and Amazon Aurora MySQL engines include additional High Availability options. Even with a single database instance, Amazon Aurora increases availability by replicating your data six ways across three Availability Zones. This means that your DB cluster can tolerate a failure of an Availability Zone without any loss of data and only a brief interruption of service.
In addition, you can choose to run one or more Replicas in an Amazon Aurora DB cluster. If the primary instance in the DB cluster fails, RDS automatically promotes an existing Aurora Replica to be the new primary instance and updates the server endpoint so that your application can continue operation with no manual intervention. If no Replicas have been provisioned, RDS will automatically create a new replacement DB instance for you when a failure is detected. Learn more>>
In-memory caching service (AWS Elasticache)
Amazon ElastiCache is a fully managed in-memory data store and caching service by AWS. The service improves the performance of web applications by providing managed in-memory caching layer between the web interface and the data storage, thereby enabling retrieval of information from the managed in-memory cache instead of relying entirely on slower disk-based databases.
ElastiCache supports two open-source in-memory caching engines: Memcached and Redis (also called “ElastiCache for Redis”).
Amazon Elasticache for Redis supports clustering for data replication where multiple nodes can be grouped into the same cluster. One of those nodes will act as the primary Read/Write node while the other nodes would serve as read-only replicas of data stored in the primary, called Read Replicas.
Example – If an Elasticache for Redis cluster is composed of 6 nodes then one of the nodes will act as the primary Read/Write node while the rest of the five nodes will be acting as the Read Replicas.
Data is asynchronously replicated from the primary Read/Write node to the Replica nodes. This ensures that not all data gets lost in case of failure of the primary Read/Write node, as multiple copies of the data get created through replication in the cluster. This improves the availability of data.
However, there may be still a loss of some data due to the latency of replication when failure happens in the primary node.
What’s more, is that Elasticache for Redis supports Multi-AZ replication within a cluster. This means that the nodes (primary Read/Write node and the Read Replicas) in a cluster for which Multi-AZ is selected can sprawl across multiple Availability Zones in a Region. This increases the availability of the service beyond a single Availability Zone.
Multi-AZ clusters also support automatic failovers which help to minimize downtime and improve fault tolerance.
Additionally, Elasticache for Redis supports DNS change propagation within the cluster in events of failures of the primary node, which helps to nullify any change management effort in events of failure.
Here is an AWS Blog post (by Jeff Barr) on Multi-AZ Support/Auto Failover for Amazon ElastiCache for Redis.
AWS serverless services
AWS provides a set of fully managed services that you can use to build and run serverless applications. You no longer need to worry about ensuring application fault tolerance and availability. Instead, AWS handles all of these capabilities for you.
2. Disaster Recovery:
Learn about disaster recovery in the AWS cloud.
AWS Cloud can be the perfect DR site for organizations. It provides the right set of tools for backup and recovery with no costs for infrastructure management and consumption-based pricing. Go through Using AWS for Disaster Recovery.
CloudEndure Disaster Recovery is an AWS service that makes it quick and easy to shift your disaster recovery strategy to the AWS cloud from existing physical or virtual data centers, private clouds, or other public clouds.
Refer to the AWS whitepaper Backup and Recovery Approaches Using AWS.
Know more about Affordable Enterprise-Grade Disaster Recovery Using AWS from CloudEndure.
3. Automatic Failover:
Cloud platforms provide automatic failover capabilities in highly available environments in different layers of a web application architecture. They allow failover configurations to be set up right from the DNS or domain service through content delivery networks, application servers, in-memory caching, and the database layers.
Let’s look into each of these layers from an automatic failover perspective and from the standpoint of the AWS cloud.
DNS Service (AWS Route 53)
Route 53 DNS service offers out-of-the-box health-check based failover routing capability. When a website is hosted on multiple HTTP resources (or servers) then Route 53 can be configured to perform health-check of these resources, and respond to DNS queries only with the resources that are healthy.
For example, suppose your website, example.com, is hosted on six servers, two each in three data centers around the world. You can configure Route 53 to check the health of those servers and to respond to DNS queries for example.com using only the servers that are currently healthy.
Virtual/ application servers (AWS EC2 instances)
Automatic failover happens with the help of multiple availability zones.
Refer to making application failover seamless for a detailed account of how auto-failovers happen for EC2 instances in the AWS cloud platform leveraging applicAtion load balancers, elastic IPs, and domain name resolutions.
Content Delivery Networks (AWS CloudFront)
Content Delivery Networks (AWS CloudFront) can be set up with multiple origins grouped into an Origin Group in which you designate a primary origin for CloudFront plus a secondary or standby origin. This ensures high availability of CloudFront service. The CloudFront automatically switches to the standby origin when the primary origin returns specific HTTP status code failure responses.
Please refer to the detail AWS documentation for the optimization of CloudFront failover for high availability.
Relational Database Services (RDS)
Please refer to the AWS documentation on “how RDS supports high availability and automatic failover” in the “High Availability” section of this paper.
Summary
With so much in the offering for application resilience, public cloud platforms are worth a try. Think about it. What’s in it for your organization! Do you have a case! Odds are high that you do.
Cloud adoption is not a privilege anymore but an undeclared mandate in today’s world for acquiring global digital footprints.
Feel free to share your thoughts.
Part 1 of this four-part blog series is available at What workloads you must move to the cloud – Part 1 (for application scalability)