Resilience Validation

Abstract

With the greater complexities in application and infrastructure landscapes, the risk of failure is ever increasing. Building a robust, highly available & fault tolerant software system becomes a challenging task and an absolute necessity in today's world. Even a minute of downtime can cost millions of dollars.

According to a study ( 57 percent of about 1,200 major organizations experience one or more application failures per month, resulting in user inconvenience or business disruptions. Interestingly, larger organizations tend to have more failures on an average due to the greater complexity in their environments. Many companies often face challenges while trying to find right strategy and methodology to build a resilient system. Identifying key tenets and right feature group are primary focus areas of validation to improve business continuity and application resilience.

The traditional approach to resiliency is changing to meet the requirements of today’s IT. Current industry trends are focused on proactive failure testing and providing latency & fault tolerance abilities in an automated way. Some of the key aspects covered as part of the paper include deploying the right strategy to build resilient systems. Also it covers about how an enhanced Chaos framework helped clients uncover & resolve potential resilience issues

Some of the key takeaways from the paper are

1) Key tenets of resilient application.

2) Different testing and engineering approaches toward ensuring a resilient system

3) Industry Standards & Best Practices.

4) Overview of the enhanced Chaos Framework and the ensuing value delivered.

Problem Statements:

Is our application resilient to handle unforeseen or unplanned events? Being prepared to manoeuvre disruptive events has now become a mandate since the outage costs are in increasing trend as the user base and complexities expands exponentially.IHS Inc. (NYSE: IHS) revealed that in aggregate, information and communication technology downtime is costing North American organizations $700 billion per year.Cascading effects would be more for customers & strategic partners. There will be impact on revenue, employee productivity, reputation and quality. Penalties need to be paid as per the Service Level Agreement and for the damage caused by the outage.

Applications in the Customer application landscape suffered from frequent production outages mostly caused by Downstream, Network, Self-Inflicted, Firewall, Caching DNS(DDOS Attack), Hardware, JMS etc. Production incident analysis were carried out on various applications to document, quantify and better understand the nature of these outages.Frequent outages had significant financial implication.

Production incident analysis revealed few key areas that contributed to outages, they are as follows:

Most of the applications suffered from single point of failure
There was limited tracing and auditing to help in root cause analysis when systems suffered failures/errors. This increased the MTTR and restricted effective and efficient RCA
Applications supported limited tolerance to fault and were easily impacted if downstream suffered or near neighbors suffered failures.
The applications needed manual intervention for recovery in case of failures and had little self-healing capability
The design and technology adopted prevented linear scalability for most of the applications
All of these contributed to production incidents, which resulted in significant downtime and frequent outages amongst the systems.

In this paper, we will discuss how Cognizant performance engineering and testing startergy helped client to make their applications resilient.

Solution:

To address production outages, a decision was made to transform the application into a highly resilient and highly available system.

Based the cost-benefit analysis, a goal to reduce the impact of outages by 40% was set to be achieved within 1 year.
Commitments were made to move from 5-6 average incidents per month to 3 per month
Improvement were sought across key architectural tenants like Simplicity, High availability, Portable, Operates at scale, Performant, automated and telemetry
A roadmap leading to highly available and resilient system was established based on following principles.

Embrace High Availability technologies.

QA transformation.

Process Improvement.

Production Application Performance Management

How an application should be

Resiliency Implementation:

System Review constituted a key activity for Resiliency features Implementation. This was a collaboration of Development, QA, Architecture and Operations teams to discuss areas impacting resiliency such as Historical Incidents, Single Point Failure, FMEA and Data dependencies.

•This review process helped organize the Resiliency features for implementation into a prioritization matrix as shown below.

Resiliency Validation:

Resilience validation for different applications and streams were broadly categorized across three planes.

Application Resilience: Resilience validation focusing on application architecture, design and coding, business process / flows within the application

Platform Resilience: Validate the resilience built-in for infrastructure and platform components (hardware, operating system, web / J2EE / .NET containers, databases, VM, network, products) etc.

Operational Readiness: Review the business processes, SLAs, monitoring and alerting mechanisms, support structure (L1 / L2 support, application criticality mapping etc.)

Resiliency Feature Groups:

Various resiliency features are logically grouped under following feature-groups.

Tests scenarios are designed to validate each feature-group which will help to ensure resiliency test coverage

War Gaming:

War gaming is a process of simulating failures / production events and observe the system behavior.

•War Gaming will help augment operational efficiency and stability through discovery, practice and teamwork.

• To ensure customers experience the absolute least amount of impact during unanticipated production events.

•Validate Fault, load, latency & data tolerance levels of application

Emerging industry standard frameworks and concepts

Netflix pioneered in the field of Resiliency validation and had done lot of R&D to enable resiliency features in the application and provided solutions to validate the resilient aspect of the systems.

Hystrix:

Hystrix is a latency and fault tolerance library developed by Netflix to stop cascading failures and enable resilience in the distributed system.

Key Features

Helps to control the interactions between distributed services by adding latency tolerance and fault tolerance logic
Solves the problem by isolating points of access between the services which helps stopping cascading failures
Provides fallback options.

Simian Armies:

The Simian Army is a suite of tools for keeping cloud systems in top form. These tools are developed to automate the process of bring down instances, data centers, Regions, health check, induce latencies, Rules check, to check unused resources etc.

Cognizant Home-Grown Chaos Framework:

Cognizant has developed home-grown chaos framework inspired by Netflix Simian Army tools. Cognizant Chaos framework is a one platform/solutionfor orchestrating different failure & latency simulations inWindows, Linux & AWS platforms.

Benefits:

Easy to setup and runs from local PC’s (Personal Computers).
Intuitive user interface & design.
Comes with built in common scenarios.
Allows parallel job executions.
Provides scheduler & executor in GUI
Easy to add any new scenarios / bugs with minimal amount of coding.
Comes with template feature, which will save design and scheduling time.
Provides logs that track all events, these logs can be used for analyzing and troubleshooting issues.

Case Study:

Client is the largest cable company and home Internet service provider in the United States. Some of the services offered by the client are cable television, broadband Internet, telephone service and in some areas home security (including burglar alarms, surveillance cameras, fire alarm systems and home automation) to both residential and commercial customer. Cognizant has helped client to implement resiliency featuresusing some of the newer solutions/platform for which client was an early adopter.Cognizant team had to innovate, improvise and develop testing, monitoring and debugging solutions to fill-in the gaps in industry standard tools which were still catching up.

Success Stories:

Conclusion:

Cognizant has proven stratergy in place which will helps to ensure reliable and continuous business operations in the face of anticipated or unforeseen service disruptions and provide agility to adapt to dynamic business needs

References & Appendix

Cognizant internal reference links & materials.

Author Biography

Ramkumar Natarajan has over 13+ years of experience with specialization in Performance testing and engineering space. He has expertise in initiating new process improvements, proposing innovative ideas & building solutions to clients. Ram is part of Cognizant QE&A NFT CoE & leads the Resiliency testing initiatives. Ram holds Master’s degree in Computer Applications from Madras University. Earlier he has worked with WIPRO & Ramco Systems.

THANK YOU!