High availability(HA) means undisrupted functioning of a system for a long time so that users face no troubles in accessing the system almost all the times. Customers expect from companies that their services will be highly available 24x7x365, especially when services that are linked to life, health, well-being etc could be paralyzed due to system failures. Its important to note that high availability does not mean a system will never face a failure. HA means a system that is designed to manage failures and minimize downtime (time period for which system service is not available) so that services can be restored with negligible adverse impact on users. Uptime is used for time period during which a system is in operation.
A system failure occurs due to failures of components in the system, resulting into downtime and non-availability of system resources, which is technically called as outage. A system is sometimes brought down for maintenance activities, but most outages are caused by system failures. Numbers of global cloud service users are on the rise, but even a best-resourced cloud platform, such as Office365 messaging or Kindle Services, are prone to crash due to hardware or software failures, sudden rise in traffic, bugs, lack of redundancy etc.
Frequent downtimes put a negative impression on users and discourage them to further avail such services due to high risk of both revenue loss and customer frustration during peak hours.High availability testing plays a significant role in ensuring that a service system delivers top-notch performance and suffers minimal downtime situations despite system failures.
Why high availability testing?
High availability testing helps find bugs/issues that have adverse impacts on the functioning of a system. It opens doors for engineers to gain insight into the behavior of a system under failover situations. Failover in simple terms can be described as the event that avoids or reduces negative impacts of a primary system failure by automatically switching to a highly reliable backup.
High availability testing ensures that such a backup actually keeps the system accessible in the hour of need. Redundancy is added into a system to deal with single points of failure, so that failure of a single components in the system does not mean failure of the entire system. RAID(redundant array of independent disks) helps store the same data on multiple servers, thus in the wake of failure of one of the servers, the same data can be accessed through other servers. The most effective way to ensure flawless functioning of redundancy in a system is high availability testing.
High availability testing also helps in determining MTBF(Mean Time Between Failures), an arithmetic calculation of average time a product or device operates before a failure.
Objectives behind high availability testing
To check capability of system design for high availability
Increase fault tolerance
Prevent outright failures of online cloud service
Design fault model
Fault tolerance is important for high availability system as it avoids outage by quickly responding to an unexpected failure of software or hardware components. Fault model is created to map all kinds of faults that could take place in a system. Creation of a fault model is a stepping stone to fault tolerance testing, which ensures that the system is well capable of dealing with those faults without affecting availability of resources to users.
Importance of high availability testing in production
Its not sufficient to conduct high availability testing only in pre-production environment as infrastructure upgrades and maintenance factors are different in production environment. Faults related to software, hardware and firmware are environment specific and production environments also often witness expansions for scaling, addition of new datacenters, more database hardware etc. All these aspects make testing in production environment important for high availability of system resources.
With this, I am wrapping up this blog, hoping it has helped you understand the role of high availability testing in making a system operational for a long time.