Practical Site Reliability Engineering

Demystifying the Site Reliability Engineering Paradigm

To provide competitive and cognitive services to their venerable customers and clients, businesses across the globe are strategizing to leverage the distinct capabilities of IT systems. There is a widespread recognition that IT is the most crucial contributor and important ingredient for achieving the required business automation, augmentation, and acceleration. The unique advancements being harvested in the IT space directly enable the much-anticipated business productivity, agility, affordability, and adaptivity. In other words, businesses across the globe unwaveringly expect their business offerings, outputs, and operations to be robust, reliable, and versatile. This demand has a direct and decisive impact on IT, and hence IT professionals are striving hard and stretching further to put highly responsive, resilient, scalable, available, and secure systems in place to meet the varying needs and mandates of businesses. Thus, with the informed adoption of all kinds of noteworthy advancements being unearthed in the IT space, business houses and behemoths are to lustrously fulfil the elusive goal of customer satisfaction.

Recently, there has been a widespread insistence for IT reliability that, in turn, enables business dependability. There are refined processes, integrated platforms, enabling patterns, breakthrough products, best practices, optimized infrastructures, adaptive features, and architectures toward heightened IT reliability.

This chapter will explain the following topics:

The origin
The journey so far
The fresh opportunities and possibilities
The prospects and perspectives
The impending challenges and concerns
The future

Precisely speaking, the charter for any Site Reliability Engineering (SRE) team in any growing IT organization is how to create highly reliable applications, and the other is how to plan, provision, and put up highly dependable, scalable, available, performing, and secure infrastructures to host and run those applications.

Setting the context for practical SRE

It is appropriate to give some background information for this new engineering discipline to enhance readability. SRE is a quickly emerging and evolving field of study and research. The market and mind shares of the SRE field are consistently climbing. Businesses, having decisively understood the strategic significance of SRE, are keen to formulate and firm up a workable strategy.

Characterizing the next-generation software systems

Software applications are increasingly complicated yet sophisticated. Highly integrated systems are the new norm these days. Enterprise-grade applications ought to be seamlessly integrated with several third-party software components running in distributed and disparate systems. Increasingly, software applications are made out of a number of interactive, transformative, and disruptive services in an ad hoc manner on an as-needed basis. Multi-channel, multimedia, multi-modal, multi-device, and multi-tenant applications are becoming pervasive and persuasive. There are also enterprise, cloud, mobile, Internet of Things (IoT), blockchain, cognitive, and embedded applications hosted in virtual and containerized environments. Then, there are industry-specific and vertical applications (energy, retail, government, telecommunication, supply chain, utility, healthcare, banking, and insurance, automobiles, avionics, and robotics) being designed and delivered via cloud infrastructures.

There are software packages, homegrown software, turnkey solutions, scientific, and technical computing services, and customizable and configurable software applications to meet distinct business requirements. In short, there are operational, transactional, and analytical applications running on private, public, and hybrid clouds. With the exponential growth of connected devices, smart sensors, and actuators, fog gateways, smartphones, microcontrollers, and single board computers (SBCs), the software enabled data analytics and proximate moves to edge devices to accomplish real-time data capture, processing, decision-making, and action.

We are destined to move towards real-time analytics and applications. Thus, it is clear that software is purposefully penetrative, participative, and productive. Largely, it is quite a software-intensive world.

Characterizing the next-generation hardware systems

Similar to the quickly growing software engineering field, hardware engineering is also on the fast track. These days, there are clusters, grids, and clouds of IT infrastructures. There are powerful appliances, cloud-in-a-box options, hyper-converged infrastructures, and commodity servers for hosting IT platforms and business applications. The physical machines are touted as bare metal servers. The virtual versions of the physical machines are the virtual machines and containers. We are heading toward the era of hardware infrastructure programming. That is, closed, inflexible, and difficult to manage and maintain bare-metal servers are being partitioned into a number of virtual machines and containers that are highly flexible, open, easily manageable, and replaceable, not to mention quickly provisionable, independently deployable, and horizontally scalable. The infrastructure partitioning and provisioning gets sped up with scores of automated tools to enable the rapid delivery of software applications. The rewarding aspects of continuous integration, deployment, and delivery are being facilitated through a combination of containers, microservices, configuration management solutions, DevOps tools, and Continuous Integration (CI) platforms.

Moving toward hybrid IT and distributed computing

Worldwide institutions, individuals, and innovators are keenly embracing cloud technology with all its clarity and confidence. With the faster maturity and stability of cloud environments, there is a distinct growth in building and delivering cloud-native applications, and there are viable articulations and approaches to readily make cloud native software. Traditional and legacy software applications are being meticulously modernized and moved to cloud environments to reap the originally envisaged benefits of the cloud idea. Cloud software engineering is one hot area, drawing the attention of many software engineers across the globe. There are public, private, and hybrid clouds. Recently, we have heard more about edge/fog clouds. Still, there are traditional IT environments that are being considered in the hybrid world.

There are development teams all over the world working in multiple time zones. Due to the diversity and multiplicity of IT systems and business applications, distributed applications are being touted as the way forward. That is, the various components of any software application are being distributed across multiple locations for enabling redundancy enabled high availability. Fault-tolerance, less latency, independent software development, and no vendor lock-in are being given as the reason for the realm of distributed applications. Accordingly, software programming models are being adroitly tweaked so that they deliver optimal performance in the era of distributed and decentralized applications. Multiple development teams working in multiple time zones across the globe have become the new norm in this hybrid world of on-shore and off-shore development.

With the big-data era upon us, we need the most usable and uniquely distributed computing paradigm through the dynamic pool of commoditized servers and inexpensive computers. With the exponential growth of connected devices, the days of device clouds are not too far away. That is, distributed and decentralized devices are bound to be clubbed together in large numbers to form ad hoc and application-specific cloud environments for data capture, ingestion, pre-processing, and analytics. Thus, there is no doubt that the future belongs to distributed computing. The fully matured and stabilized centralized computing is unsustainable due to the need for web-scale applications. Also, the next-generation internet is the internet of digitized things, connected devices, and microservices.

Envisioning the digital era

There are a bunch of digitization and edge technologies bringing forth a number of business innovations and improvisations. As enterprises are embracing these technologies, the ensuring era is being touted as the digital transformation and intelligence era. This section helps in telling you about all that needs to be changed through the absorption of these pioneering and path-breaking technologies and tools.

The field of information and communication technology (ICT) is rapidly growing with the arrival of scores of pioneering technologies, and this trend is expediently and elegantly automating multiple business tasks. Then, the maturity and stability of orchestration technologies and tools is bound to club together multiple automated jobs and automate the aggregated ones. We will now discuss the latest trends and transitions happening in the ICT space.

Due to the heterogeneity and multiplicity of software technologies such as programming languages, development models, data formats, and protocols, software development and operational complexities are growing continuously. There are several breakthrough mechanisms to develop and run enterprise-grade software in an agile and adroit fashion. There came a number of complexity mitigation and rapid development techniques for producing production-grade software in a swift and smart manner. The leverage of "divide and conquer" and "the separation of crosscutting concerns" techniques is being consistently experimented with and developers are being encouraged to develop risk-free and futuristic software services. The potential concepts of abstraction, encapsulation, virtualization, and other compartmentalization methods are being invoked to reduce the software production pain. In addition, there are performance engineering and enhancement aspects that are getting the utmost consideration from software architects. Thus, software development processes, best practices, design patterns, evaluation metrics, key guidelines, integrated platforms, enabling frameworks, simplifying templates, and programming models are gaining immense significance in this software-defined world.

Thus, there are several breakthrough technologies for digital innovations, disruptions, and transformations. Primarily, the IoT paradigm generates a lot of multi-structured digital data and the famous artificial intelligence (AI) technologies, such as machine and deep learning, enables the extrication of actionable insights out of the digital data. Transitioning raw digital data into information, knowledge, and wisdom is the key differentiator for implementing digitally transformed and intelligent societies. Cloud IT is being positioned as the best-in-class IT environment for enabling and expediting the digital transformation.

With digitization and edge technologies, our everyday items become digitized to join in with mainstream computing. That is, we will be encountering trillions of digitized entities and elements in the years ahead. With the faster stability and maturity of the IoT, cyber physical systems (CPS), ambient intelligence (AmI), and pervasive computing technologies and tools, we are being bombarded with innumerable connected devices, instruments, machines, drones, robots, utilities, consumer electronics, wares, equipment, and appliances. Now, with the unprecedented interest and investment in AI (machine and deep learning, computer vision, and natural language processing), algorithms and approaches, and IoT device data (collaborations, coordination, correlation, and corroboration) are meticulously captured, cleansed, and crunched to extricate actionable insights/digital intelligence in time. There are several promising, potential, and proven digital technologies emerging and evolving quickly in synchronization, with a variety of data mining, processing, and analytics. These innovations and disruptions eventually lead to digital transformation. Thus, digitization and edge technologies in association with digital intelligence algorithms and tools lead to the realization and sustenance of digitally transformed environments (smarter hotels, homes, hospitals, and so on). We can easily anticipate and articulate digitally transformed countries, counties, and cities in the years to come with pioneering and groundbreaking digital technologies and tools.

The cloud service paradigm

The cloud era is setting in and settling steadily. The aiding processes, platforms, policies, procedures, practices, and patterns are being framed and firmed up by IT professionals and professors, to tend toward the cloud. The following sections give the necessary details for our esteemed readers.

The cloud applications, platforms, and infrastructures are gaining immense popularity these days. Cloud applications are of two primary types:

Cloud-enabled: The currently running massive and monolithic applications get modernized and migrated to cloud environments to reap the distinct benefits of the cloud paradigm
Cloud-native: This is all about designing, developing, debugging, delivering, and deploying applications directly on cloud environments by intrinsically leveraging the non-functional capabilities of cloud environments

The current and conventional applications that are hosted and running on various IT environments are being meticulously modernized and migrated to standardized and multifaceted cloud environments to reap all the originally expressed benefits of cloud paradigm. Besides enabling business-critical, legacy, and monolithic applications to be cloud-ready, there are endeavors for designing, developing, debugging, deploying, and delivering enterprise-class applications in cloud environments, harvesting all of the unique characteristics of cloud infrastructure and platforms. These applications natively absorb the various characteristics of cloud infrastructures and act adaptively. There is microservices architecture (MSA) for designing next-generation enterprise-class applications. MSA is being deftly leveraged to enable massive applications to be partitioned into a collection of decoupled, easily manageable, and fine-grained microservices.

With the decisive adoption of cloud technologies and tools, every component of enterprise IT is being readied to be delivered as a service. The cloud idea has really and rewardingly brought in a stream of innovations, disruptions, and transformations for the IT industry. The days of IT as a Service (ITaaS) will soon become a reality, due to a stream of noteworthy advancements and accomplishments in the cloud space.

The ubiquity of cloud platforms and infrastructures

The other key aspect is to have reliable, available, scalable, and secure IT environments (cloud and non-cloud). We talked about producing versatile software packages and libraries. We also talked about setting up and sustaining appropriate IT infrastructures for successfully running various kinds of IT and business applications. Increasingly, the traditional data centers and server farms are being modernized through the smart application of cloud-enablement technologies and tools. The cloud idea establishes and enforces IT rationalization, the heightened utilization of IT resources, and optimization. There is a growing number of massive public cloud environments (AWS, Microsoft Azure, Google cloud, IBM cloud, and Oracle cloud) that are encompassing thousands of commodity and high-end server machines, storage appliance arrays, and networking components to accommodate and accomplish the varying IT needs of the whole world. Government organizations, business behemoths, various service providers, and institutions are empowering their own IT centers into private cloud environments. Then, on an as-needed basis, private clouds are beginning to match the various capabilities of public clouds to meet specific requirements. In short, cloud environments are being positioned as the one-stop IT solution for our professional, social, and personal IT requirements.

The cloud is becoming pervasive with the unique contributions of many players from the IT industry, worldwide academic institutions, and research labs. We have plenty of private, public, and hybrid cloud environments. The surging popularity of fog/edge computing leads to the formation of fog/edge device clouds, which are contributing immensely to produce people-centric and real-time applications. The fog or edge device computing is all about leveraging scores of connected and capable devices to form a kind of purpose-specific as well as agnostic device cloud to collect, cleanse and crunch sensor, actuator, device, machine, instrument, and equipment poly-structured and real-time data emanating from all sorts of physical, mechanical, and electrical systems on the ground. With the projected billions of connected devices, the future beckons and bats for device clusters and clouds. Definitely, the cloud movement has penetrated every industry and the IT phenomenon is redefined and resuscitated by the roaring success of the cloud. Soon, cloud applications, platforms, and infrastructures will be everywhere. IT is all set to become the fifth social utility. The pertinent and paramount challenge is how to bring forth deeper and decisive automation in the cloud IT space.

The need for deeply automated and adaptive cloud centers with clouds emerging as the most flexible, futuristic, and fabulous IT environments to host and run IT and business workloads, there is a rush for bringing as much automation as possible to speed up the process of cloud migration, software deployment and delivery, cloud monitoring, measurement and management, cloud integration and orchestration, cloud governance and security, and so on. There are several trends and transitions happening simultaneously in the IT space to realize these goals.

The growing software penetration and participation

Marc Andreessen famously penned the article Why software is eating the world several years ago. Today, we widely hear, read, and even sometimes experience buzzwords such as software-defined, compute, storage, and networking. Software is everywhere and gets embedded in everything. Software has, unquestionably, been the principal business automation and acceleration enabler. Nowadays, on its memorable and mesmerizing journey, software is penetrating into every tangible thing (physical, mechanical, and electrical) in our everyday environments to transform them into connected entities, digitized things, smart objects, and sentient materials. For example, every advanced car today has been sagaciously stuffed with millions of lines of code to be elegantly adaptive in its operations, outputs, and offerings.

Precisely speaking, the ensuing era sets the stage for having knowledge-filled, situation-aware, event-driven, service-oriented, cloud-hosted, process-optimized, and people-centric applications. These applications need to exhibit a few extra capabilities. That is, the next-generation software systems innately have to be reliable, rewarding, and reactive ones. Also, we need to arrive at competent processes, platforms, patterns, procedures, and practices for creating and sustaining high-quality systems. There are widely available non-functional requirements (NFRs), quality of service (QoS), and quality of experience (QoE) attributes, such as availability, scalability, modifiability, sustainability, security, portability, and simplicity. The challenge for every IT professional lies in producing software that unambiguously and intrinsically guarantees all the NFRs.

Agile application design: We have come across a number of agile software development methodologies. We read about extreme and pair programming, scrum, and so on. However, for the agile design of enterprise-grade applications, the stability of MSA is to activate and accelerate the application design.

Accelerated software programming: As we all know, enterprise-scale and customer-facing software applications are being developed speedily nowadays, with the faster maturity of potential agile programming methods, processes, platforms, and frameworks. There are other initiatives and inventions enabling speedier software development. There are component-based software assemblies, and service-oriented software engineering is steadily growing. There are scores of state-of-the-art tools consistently assisting component and service-based application-building phenomena. On the other hand, the software engineering aspect gets simplified and streamlined through the configuration, customization, and composition-centric application generation methods.

Automated software deployment through DevOps: There are multiple reasons for software programs to be running well in the developer's machine but not so well in other environments, including production environments. There are different editions, versions, and releases of software packages, platforms, programming languages, and frameworks. Coming to the running software is suites across different environments. There is a big disconnect between developers and operation teams due to constant friction between development and operating environments.

Further on, with agile programming techniques and tips, software applications get constructed quickly, but their integration, testing, building, delivery, and deployment aspects are not automated. Therefore, concepts such as DevOps, NoOps, and AIOps have gained immense prominence and dominance to bring in several automation enabling IT administrators. That is, these new arrivals have facilitated a seamless and spontaneous synchronization between software design, development, debugging, deployment, delivery and decommissioning processes, and people. The emergence of configuration management tools and cloud orchestration platforms enables IT infrastructure programming. That is, the term Infrastructure as Code (IaC) is facilitating the DevOps concept. That is, faster provisioning of infrastructure resources through configuration files, and the deployment of software on those infrastructure modules, is the core and central aspect of the flourishing concept of DevOps.

This is the prime reason why the concept of DevOps has started flourishing these days. This is quite a new idea that's gaining a lot of momentum within enterprise and cloud IT teams. Companies embrace this new cultural change with the leverage of multiple toolsets for Continuous Integration (CI), Continuous Delivery (CD), and Continuous Deployment (CD). Precisely speaking, besides producing enterprise-grade software applications and platforms, realizing and sustaining virtualized/containerized infrastructures with the assistance of automated tools to ensure continuous and guaranteed delivery of software-enabled and IT-assisted business capabilities to mankind is the need of the hour.

Plunging into the SRE discipline

We have understood the requirements and the challenges. The following sections describe how the SRE field is used to bridge the gap between supply and demand. As explained previously, building software applications through configuration, customization, and composition (orchestration and choreography) is progressing quickly. Speedier programming of software applications using agile programming methods is another incredible aspect of software building. The various DevOps tools from product and tool vendors quietly ensures continuous software integration, delivery, and deployment.

The business landscape is continuously evolving, and consequently the IT domain has to respond precisely and perfectly to the changing equations and expectations of the business houses. Businesses have to be extremely agile, adaptive, and reliable in their operations, offerings, and outputs. Business automation, acceleration, and augmentation are being solely provided by the various noteworthy improvements and improvisations in the IT domain.

IT agility and reliability directly guarantees the business agility and reliability. As seen previously, the goal of IT agility (software design, development, and deployment) is getting fulfilled through newer techniques. Nowadays, IT experts are looking out for ways and means for significantly enhancing IT reliability goals. Typically, IT reliability equals IT elasticity and resiliency. Let's us refer to the following bullets:

IT elasticity: When an IT system is suddenly under a heavy load, how does the IT system provision and use additional IT resources to take care of extra loads without affecting users? IT systems are supposed to be highly elastic to be right and relevant for the future of businesses. Furthermore, not only IT systems but also the business applications and the IT platforms (development, deployment, integration, orchestration, brokerage, and so on) have to be scalable. Thus, the combination of applications, platforms, and infrastructures have to contribute innately to be scalable (vertically, as well as horizontally).
IT resiliency: When an IT system is under attack from internal as well as external sources, the system has to have the wherewithal to wriggle out of that situation to continuously deliver its obligations to its subscribers without any slowdown and breakdown. IT systems have to be highly fault-tolerant to be useful for mission-critical businesses. IT systems have to come back to their original situation automatically, even if they are made to deviate from their prescribed path. Thus, error prediction, identification, isolation, and other capabilities have to be embedded into IT systems. Security and safety issues also have to be dexterously detected and contained to come out unscathed.

Thus, when IT systems are resilient and elastic, they are termed reliable systems. When IT is reliable, then the IT-enabled businesses can be reliable in their deals, deeds, and decisions that, in turn, enthuse and enlighten their customers, employees, partners, and end users.

The challenges ahead

The following are some challenges you may come across:

Bringing forth a bevy of complexity mitigation techniques. The formula is heterogeneity + multiplicity = complexity. IT (software and infrastructure) complexity is constantly improving.
Producing software packages that are fully compliant to various NFRs and QoS /QoE attributes, such as scalability, availability, stability, reliability, extensibility, accessibility, simplicity, performance/throughput, and so on.
Performing automated IT infrastructure provisioning, scheduling, configuration, monitoring, measurement, management, and governance.
Providing VM and container placement, serverless computing/Function as a Service (FaaS), workload consolidation, energy efficiency, task and job scheduling, resource allocation and usage optimization, service composition for multi-container applications, horizontal scalability, and Resource as a Service (RaaS).
Establishing IT automation, Integration, and orchestration for self-service, autonomous, and cognitive IT.
Accomplishing log, operational, and performance/scalability analytics using AI (machine and deep learning). Algorithms for producing real-time, predictive, and prescriptive insights.
Building technology sponsored solutions for enabling NoOps, ChatOps, and AIOps. The challenge is to bring forth viable and versatile solutions in the form of automated tools for fulfilling their unique requirements.
Container clustering, orchestration, and management platform solutions for producing, deploying, and sustaining microservices-centric software applications.
Bringing forth versatile software solutions such as standards-compliant service mesh solutions, API gateways, and management suites for ensuring service resiliency. With more microservices and their instances across containers (service run time), the operational complexity is on the rise.

Building resilient and reliable software through pioneering programming techniques such as reactive programming and architectural styles such as event-driven architecture (EDA).

The idea is to clearly illustrate the serious differences between agile programming, DevOps, and the SRE movement. There are several crucial challenges ahead, as we have mentioned. And the role and responsibility of the SRE technologies, tools, and tips are going to be strategic and significant toward making IT reliable, robust, and rewarding.

The need for highly reliable platforms and infrastructures

We discussed about cloud-enabled and native applications and how they are hosted on underlying cloud infrastructures to accomplish service delivery. Applications are significantly functional. However, the non-functional requirements, such as application scalability, availability, security, reliability, performance/throughput, modifiability, and so on, are being used widely. That is, producing high-quality applications is a real challenge for IT professionals. There are design, development, testing, and deployment techniques, tips, and patterns to incorporate the various NFRs into cloud applications. There are best practices and key guidelines to come out with highly scalable, available, and reliable applications.

The second challenge is to setup and sustain highly competent and cognitive cloud infrastructures to exhibit reliable behavior. The combination of highly resilient, robust, and versatile applications and infrastructures leads to the implementation of highly dependable IT that meets the business productivity, affordability, and adaptivity.

Having understood the tactical and strategic significance and value, businesses are consciously embracing the pioneering cloud paradigm. That is, all kinds of traditional IT environments are becoming cloud-enabled to reap the originally expressed business, technical, and use benefits. However, the cloud formation alone is not going to solve every business and IT problem. Besides establishing purpose-specific and agnostic cloud centers, there are a lot more things to be done to attain the business agility and reliability. The cloud center operation processes need to be refined, integrated, and orchestrated to arrive at optimized and organized processes. Each of the cloud center operations needs to be precisely defined and automated in to fulfil the true meaning of IT agility. With agile and reliable cloud applications and environments, the business competency and value are bound to go up remarkably.

The need for reliable software

We know that the subject of software reliability is a crucial one for the continued success of software engineering in the ensuing digital era. However, it is not easy thing to do. Because of the rising complexity of software suites, ensuring high reliability turns out to be a tough and time-consuming affair. Experts, evangelists, and exponents have come out with a few interesting and inspiring ideas for accomplishing reliable software systems. Primarily, there are two principal approaches; these are as follows:

Resilient microservices can lead to the realization of reliable software applications. Popular technologies include microservices, containers, Kubernetes, Terraform, API Gateway and Management Suite, Istio, and Spinnaker.
Reactive systems (resilient, responsive, message-driven, and elastic)—this is based on the famous Reactive Manifesto. There are a few specific languages and platforms (http://vertx.io/, http://reactivex.io/, https://www.lightbend.com/products/reactive-platform, RxJava, play framework, and so on) for producing reactive systems. vAkka is a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala.

Here are the other aspects being considered for producing reliable software packages:

Verification and validation of software reliability through various testing methods
Software reliability prediction algorithms and approaches
Static and dynamic code analysis methods
Patterns, processes, platforms, and practices for building reliable software packages

Let's discuss these in detail.

The emergence of microservices architecture

Mission critical and versatile applications are to be built using the highly popular MSA pattern. Monolithic applications are being consciously dismantled using the MSA paradigm to be immensely right and relevant for their users and owners. Microservices are the new building block for constructing next-generation applications. Microservices are easily manageable, independently deployable, horizontally scalable, relatively simple services. Microservices are publicly discoverable, network accessible, interoperable, API-driven, composed, replaceable, and highly isolated.

The future software development is primarily finding appropriate microservices. Here are few advantages of the MSA style:

Scalability: Any production-grade application typically can use three types of scaling. The x-axis scaling is for horizontally scalability. That is, the application has to be cloned to guarantee high availability. The second type of scale is y-axis scaling. This is for splitting the application into various application functionalities. With microservices architecture, applications (legacy, monolithic, and massive) are partitioned into a collection of easily manageable microservices. Each unit fulfils one responsibility. The third is the z-axis scaling, which is for partitioning or sharding the data. The database plays a vital role in shaping up dynamic applications. With NoSQL databases, the concept of sharing came into prominence.
Availability: Multiple instances of microservices are deployed in different containers (Docker) to guarantee high availability. Through this redundancy, the service and application availability is ensured. With multiple instances of services are being hosted and run through Docker containers, the load-balancing of service instances is utilized to ensure the high-availability of services. The widely used circuit breaker pattern is used to accomplish the much-needed fault-tolerance. That is, the redundancy of services through instances ensures high availability, whereas the circuit-breaker pattern guarantees the resiliency of services. Service registry, discovery, and configuration capabilities are to lead the development and discovery of newer services to bring forth additional business (vertical) and IT (horizontal) services. With services forming a dynamic and ad hoc service meshes, the days of service communication, collaboration, corroborations, and correlations are not too far away.
Continuous deployment: Microservices are independently deployable, horizontally scalable, and self-defined. Microservices are decoupled/lightly coupled and cohesive fulfilling the elusive mandate of modularity. The dependency imposed issues get nullified by embracing this architectural style. This leads to the deployment of any service independent of one another for faster and more continuous deployment.
Loose coupling: As indicated previously, microservices are autonomous and independent by innately providing the much-needed loose coupling. Every microservice has its own layered- architecture at the service level and its own database at the backend.
Polyglot microservices: Microservices can be implemented through a variety of programming languages. As such, there is no technology lock-in. Any technology can be used to realize microservices. Similarly, there is no compulsion for using certain databases. Microservices work with any file system SQL databases, NoSQL and NewSQL databases, search engines, and so on.

Performance: There are performance engineering and enhancement techniques and tips in the microservices arena. For example, high-blocking calls services are implemented in the single threaded technology stack, whereas high CPU usage services are implemented using multiple threads.

There are other benefits for business and IT teams by employing the fast-maturing and stabilizing microservices architecture. The tool ecosystem is on the climb, and hence implementing and involving microservices gets simplified and streamlined. Automated tools ease and speed up building and operationalizing microservices.

Docker enabled containerization

The Docker idea has shaken the software world. A bevy of hitherto-unknown advancements are being realized through containerization. The software portability requirement, which has been lingering for a long time, gets solved through the open source Docker platform. The real-time elasticity of Docker containers hosting a variety of microservices enabling the real-time scalability of business-critical software applications is being touted as the key factor and facet for the surging popularity of containerization. The intersection of microservices and Docker containers domains has brought in paradigm shifts for software developers, as well as for system administrators. The lightweight nature of Docker containers along with the standardized packaging format in association with the Docker platform goes a long way in stabilizing and speeding up software deployment.

The container is a way to package software along with configuration files, dependencies, and binaries required to enable the software on any operating environment. There are a number of crucial advantages; they are as follows:

Environment consistency: Applications/processes/microservices running on containers behave consistently in different environments (development, testing, staging, replica, and production). This eliminates any kind of environmental inconsistencies and makes testing and debugging less cumbersome and less time-consuming.
Faster deployment: A container is lightweight and starts and stops in a few seconds, as it is not required to boot any OS image. This eventually helps to achieve faster creation, deployment, and high availability.

Isolation: Containers running on the same machine using the same resources are isolated from one another. When we start a container with the docker run command, the Docker platform does a few interesting things behind the scenes. That is, Docker creates a set of namespaces and control groups for the container. The namespaces and control groups (cgroups) are the kernel-level capabilities. The role of the namespaces feature is to provide the required isolation for the recently created container from other containers running in the host. Also, containers are clearly segregated from the Docker host. This separation does a lot of good for containers in the form of safety and security. Also, this unique separation ensures that any malware, virus, or any phishing attack on one container does not propagate to other running containers. In short, processes running within a container cannot see and affect processes running in another container or in the host system. Also, as we are moving toward a multi-container applications era, each container has to have its own network stack for container networking and communication. With this network separation, containers don't get any sort of privileged access to the sockets or interfaces of other containers in the same Docker host or across it. The network interface is the only way for containers to interact with one another as well as with the host. Furthermore, when we specify public ports for containers, the IP traffic is allowed between containers. They can ping one another, send and receive UDP packets, and establish TCP connections.
Portability: Containers can run everywhere. They can run in our laptop, enterprise servers, and cloud servers. That is, the long-standing goal of write once and run everywhere is getting fulfilled through the containerization movement.

There are other important advantages of containerization. There are products and platforms that facilitate the cool convergence of containerization and virtualization to cater for emerging IT needs.

Containerized microservices

One paradigm shift in the IT space in the recent past is the emergence of containers for deftly hosting and running microservices. Because of the lightweight nature of containers, provisioning containers is done at lightning speed. Also, the horizontal scalability of microservices gets performed easily by their hosting environments (containers). Thus, this combination of microservices and containers brings a number of benefits for software development and IT operations. There can be hundreds of containers in a single physical machine.

The celebrated linkage helps to have multiple instances of microservices in a machine. With containers talking to one another across Docker hosts, multiple microservice instances can find one another to compose bigger and better composite services that are business and process-aware. Thus, all the advancements in the containerization space have a direct and indirect impacts on microservices engineering, management, governance, security, orchestration, and science.

The key technology drivers of containerized cloud environments are as follows:

The faster maturity and stability of containers (application and data).
New types of containers such as Kata Containers and HyperContainers.
MSA emerging as the most optimized architectural style for enterprise-scale applications.
There is a cool convergence between containers and microservices. Containers are the most optimized hosting and execution of runtime for microservices.
Web/cloud, mobile, wearable and IoT applications, platforms, middleware, UI, operational, analytical, and transactional applications are modernized as cloud-enabled applications, and the greenfield applications are built as cloud-native applications.
The surging popularity of Kubernetes as the container clustering, orchestration, and management platform solution leads to the realization of containerized clouds.
The emergence of API gateways simplifies and streamlines the access and usage of microservices collectively.
The faster maturity and stability of service mesh solutions ensures the resiliency of microservices and the reliability of cloud-hosted applications.

The challenges of containerized cloud environments are as follows:

Moving from monoliths to microservices is not an easy transition.
There may be thousands of microservices and their instances (redundancy) in a cloud environment.
For crafting an application, the data and control flows ought to pass through different and distributed microservices spread across multiple cloud centers.
The best practice says that there is a one to one mapping between microservice instances and containers. That is, separate containers are being allocated for separate microservice instances.
Due to the resulting dense environments, the operational and management complexities of containerized clouds are bound to escalate.

Tracking and tracing service request messages and events among microservices turn out to be a complex affair.
Troubleshooting and doing root cause analyses in microservices environments become a tough assignment.
Container life cycle management functionalities have to be automated.
Client-to-microservice (north-to-south traffic) communication remains a challenge.
Service-to-service (east-to-west traffic) communication has to be made resilient and robust.

Kubernetes for container orchestration

A MSA requires the creating and clubbing together of several fine-grained and easily manageable services that are lightweight, independently deployable, horizontally scalable, extremely portable, and so on. Containers provides an ideal hosting and run time environment for the accelerated building, packaging, shipping, deployment, and delivery of microservices. Other benefits include workload isolation and automated life-cycle management. With a greater number of containers (microservices and their instances) being stuffed into every physical machine, the operational and management complexities of containerized cloud environments are on the higher side. Also, the number of multi-container applications is increasing quickly. Thus, we need a standardized orchestration platform along with container cluster management capability. Kubernetes is the popular container cluster manager, and it consists of several architectural components, including pods, labels, replication controllers, and services. Let's take a look at them:

As mentioned elsewhere, there are several important ingredients in the Kubernetes architecture. Pods are the most visible, viable, and ephemeral units that comprise one or more tightly coupled containers. That means containers within a pod sail and sink together. There is no possibility of monitoring, measuring, and managing individual containers within a pod. In other words, pods are the base unit of operation for Kubernetes. Kubernetes does not operate at the level of containers. There can be multiple pods in a single server node and data sharing easily happens in between pods. Kubernetes automatically provision and allocate pods for various services. Each pod has its own IP address and shares the localhost and volumes. Based on the faults and failures, additional pods can be quickly provisioned and scheduled to ensure the continuity of services. Similarly, under heightened loads, Kubernetes adds additional resources in the form of pods to ensure system and service performance. Depending on the traffic, resources can be added and removed to fulfil the goal of elasticity.

Labels are typically the metadata that is attached to objects, including pods.
Replication controllers, as articulated previously, have the capability to create new pods leveraging a pod template. That is, as per the configuration, Kubernetes is able to run the sufficient number of pods at any point in time. Replication controllers accomplish this unique demand by continuously polling the container cluster. If there is any pod going down, this controller software immediately jumps into action to incorporate an additional pod to ensure that the specified number of pods with a given set of labels are running within the container cluster.
Services is another capability that embedded into Kubernetes architecture. This functionality and facility offers a low-overhead way to route all kinds of service requests to a set of pods to accomplish the requests. Labels is the way forward for selecting the most appropriate pods. Services provide methods to externalize legacy components, such as databases, with a cluster. They also provide stable endpoints as clusters shrink and grow and become configured and reconfigured across new nodes within the cluster manager. Their job is to remove the pain of keeping track of application components that exist within a cluster instance.

The fast proliferation of application and data containers in producing composite services is facilitated through the leveraging of Kubernetes, and it fastening the era of containerization. Both traditional and modern IT environments are embracing this compartmentalization technology to surmount some of the crucial challenges and concerns of the virtualization technology.

API Gateways and management suite: This is another platform for bringing in reliable client and service interactions. The various features and functionalities of API tools include the following:

It acts as a router. It is the only entry point to our collection of microservices. This way, microservices are not needed to be public anymore but are behind an internal network. An API Gateway is responsible for making requests against a service or another one (service discovery).
It acts as a data aggregator. API Gateway fetches data from several services and aggregates it to return a single rich response. Depending on the API consumer, data representation may change according to the needs, and here is where backend for frontend (BFF) comes into play.
It is a protocol abstraction layer. The API Gateway can be exposed as a REST API or a GraphQL or whatever, no matter what protocol or technology is being used internally to communicate with the microservices.

Error management is centralized. When a service is not available, is getting too slow, and so on, the API Gateway can provide data from the cache, default responses or make smart decisions to avoid bottlenecks or fatal errors propagation. This keeps the circuit closed (circuit breaker) and makes the system more resilient and reliable.
The granularity of APIs provided by microservices is often different than what a client needs. Microservices typically provide fine-grained APIs, which means that clients need to interact with multiple services. The API Gateway can combine these multiple fine-grained services into a single combined API that clients can use, thereby simplifying the client application and improving performance.
Network performance is different for different types of clients. The API Gateway can define device-specific APIs that reduce the number of calls required to be made over slower WAN or mobile networks. The API Gateway being a server-side application makes it more efficient to make multiple calls to backend services over LAN.
The number of service instances and their locations (host and port) changes dynamically. The API Gateway can incorporate these backend changes without requiring frontend client applications by determining backend service locations.
Different clients may need different levels of security. For example, external applications may need a higher level of security to access the same APIs that internal applications may access without the additional security layer.

Service mesh solutions for microservice resiliency: Distributed computing is the way forward for running web-scale applications and big-data analytics. By the horizontal scalability and individual life cycle of management of various application modules (microservices) of customer-facing applications, the aspect of distributed deployment of IT resources (highly programmable and configurable bare metal servers, virtual machines, and containers) is being insisted. That is, the goal of the centralized management of distributed deployment of IT resources and applications has to be fulfilled. Such kinds of monitoring, measurement, and management is required for ensuring proactive, preemptive, and prompt failure anticipation and correction of all sorts of participating and contributing constituents. In other words, accomplishing the resiliency target is given much importance with the era of distributed computing. Policy establishment and enforcement is a proven way for bringing in a few specific automations. There are programming language-specific frameworks to add additional code and configuration into application code for implementing highly available and fault-tolerant applications.

It is therefore paramount to have a programming-agnostic resiliency and fault-tolerance framework in the microservices world. Service mesh is the appropriate way forward for creating and sustaining resilient microservices. Istio, an industry-strength open source framework, provides an easy way to create this service mesh. The following diagram conveys the difference between the traditional ESB tool-based and service-oriented application integration and the lightweight and elastic microservices-based application interactions:

A service mesh is a software solution for establishing a mesh out of all kinds of participating and contributing services. This mesh software enables the setting up and sustaining of inter-service communication. The service mesh is a kind of infrastructure solution. Consider the following:

A given microservice does not directly communicate with the other microservices.
Instead, all service-to-service communications take place on a service mesh software solution, which is a kind of sidecar proxy. Sidecar is a famous software integration pattern.
Service mesh provides the built-in support for some of the critical network functions such as microservice resiliency and discovery.

That is, the core and common network services are being identified, abstracted, and delivered through the service mesh solution. This enables service developers to focus on business capabilities alone. That is, business-specific features are with services, whereas all the horizontal (technical, network communication, security, enrichment, intermediation, routing, and filtering) services are being implemented in the service mesh software. For instance, today, the circuit-breaking pattern is being implemented and inscribed in the service code. Now, this pattern is being accomplished through a service mesh solution.

The service mesh software works across multiple languages. That is, services can be coded using any programming and script languages. Also, there are several text and binary data transmission protocols. Microservices, to talk to other microservices, have to interact with the service mesh for initiating service communication. This service-to-service mesh communication can happen over all the standard protocols, such as HTTP1.x/2.x, gRPC, and so on. We can write microservices using any technology, and they still work with the service mesh. The following diagram illustrates the contributions of the service mesh in making microservices resilient:

Finally, when resilient services get composed, we can produce reliable applications. Thus, the resiliency of all participating microservices leads to applications that are highly dependable.

Resilient microservices and reliable applications

Progressively, the world is connected and software-enabled. We often hear, read, and experience software-defined computing, storing, and networking capabilities. Physical, mechanical, electrical, and electronics systems in our everyday environments are being meticulously stuffed with software to be adroit, aware, adaptive, and articulate in their actions and reactions. Software is destined to play a strategic and significant role for producing and sustaining digitally impacted and transformed societies, one stand-out trait of new-generation software-enabled systems are responsive all the time through one or other ways. That is, they have to come out with a correct response. If the system is not responding, then another system has to respond correctly and quickly. That is, if a system is failing, an alternative system has to respond.

This is typically called system resiliency. If the system is extremely stressful due to heavy user and data loads, then additional systems have to be provisioned to respond to user's requests without any slowdown and breakdown. That is, auto-scaling is an important property for today's software systems to be right and relevant for businesses and users. This is generally called system elasticity. To make systems resilient and elastic, producing message-driven systems is the key decision. Message-driven systems are called reactive systems. Let's digress a bit here and explain the concepts behind system resiliency and elasticity.

A scalable application can scale automatically and accordingly to continuously function. Suddenly, there can be a greater number of users accessing the application. Still, the application has to continuously transact and can gracefully handle traffic peaks and dips. By adding and removing virtual machines and containers only when needed, scalable applications do their assigned tasks without any slowdown or breakdown. By dynamically provisioning additional resources, the utilization rate of scalable applications is optimal. Scalable applications support on-demand computing. There can be many users demanding the services of the application, or there can be more data getting pushed into the application. Containers and virtual machines are the primary resource and runtime environment for application components.

Reactive systems

We have seen how reliable systems are being realized through the service mesh concept. This is another approach for bringing forth reliable software systems. A reactive system is a new concept based on the widely circulated reactive manifesto. There are reactive programming models and techniques to build viable reactive systems. As described previously, any software system is comprised of multiple modules. Also, multiple components and applications need to interact with each other reliably to accomplish certain complex business functionality. In a reactive system, the individual systems are intelligent. However, the key differentiator is the interaction between the individual parts. That is, the ability to operate individually yet act in concert to achieve the intended outcome clearly differentiates reactive systems from others. A reactive system architecture allows multiple individual applications to co-exist and coalesce as a single unit and react to its surroundings adaptively. This means that they are able to scale up or down based on user and data loads, load balance, and act intelligently to be extremely sensitive and royally responsive.

It is possible to write an application in a reactive style using the proven reactive programming processes, patterns, and platforms. However, for working together to achieve evolving business needs quickly, it needs a lot more. In short, it is not that easy making a system reactive. Reactive systems are generally designed and built according to the tenets of the highly popular Reactive Manifesto. This manifesto document clearly prescribes and promotes the architecture that is responsive, resilient, elastic, and message driven. Increasingly, microservices and message-based service interactions become the widely used standard for having flexible, elastic, resilient, and loosely coupled systems. These characteristics, without an iota of doubt, are the central and core concepts of reactive systems.

Reactive programming is a subset of asynchronous programming. This is an emerging paradigm where the availability of new information (events and messages) drives the processing logic forward. Traditionally, some action gets activated and accomplished using threads of execution based on control and data flows.

This unique programming style intrinsically supports decomposing the problem into multiple discrete steps, and each step can be executed in an asynchronous and non-blocking fashion. Then, those steps can be composed to produce a composite workflow possibly unbounded in its inputs or outputs. Asynchronous processing means the processing of incoming messages or events happen sometime in the future. The event creators and message senders need not wait for the processing and the execution to get done to proceed with their responsibilities. This is generally called non-blocking execution. The threads of execution need not compete for a shared resource to get things done immediately. If the resource is not available immediately, then the threads need not wait for the unavailable resource and instead continue with other tasks at hand, using their respective resources. The point is that they can do their work without any stoppage while waiting for appropriate resources for a particular task at a particular point in time. In other words, they do not prevent the thread of execution from performing other work until the current work is done. They can perform other useful work while the resource is being occupied.

In the future, software applications have to be sensitive and responsive. The futuristic and people-centric applications, therefore, have to be capable of receiving events to be adaptive. Event capturing, storing, and processing are becoming important for enterprise, embedded, and cloud applications. Reactive programming is emerging as an important concept for producing event-driven software applications. There are simple as well as complex events. Events are primarily being streamed continuously, and hence the event-processing feature is known as streaming analytics these days. There are several streaming analytics platforms, such as Spark Streams, Kafka Streams, Apache Flink, Storm, and so on, for extricating actionable insights out of streams.

In the increasingly event-driven world, EDAs and programming models acquire more market and mind shares. And thus reactive programming is a grandiose initiative to provide a standard solution for asynchronous stream processing with non-blocking back pressure. The key benefits of reactive programming include the increased utilization of computing resources on multi-core and multi-processor hardware. There are several competent event-driven programming libraries, middleware solutions, enabling frameworks, and architectures to carefully capture, cleanse, and crunch millions of events per second. The popular libraries for facilitating event-driven programming include Akka Streams, Reactor, RxJava, and Vert.x.

Reactive programming versus reactive systems: There is a huge difference between reactive programming and reactive systems. As indicated previously, reactive programming is primarily event-driven. Reactive systems, on the other hand, are message-driven and focus on creating resilient and elastic software systems. Messages are the prime form of communication and collaboration. Distributed systems coordinate by sending, receiving, and processing messages. Messages are inherently directed, whereas events are not. Messages have a clear direction and destination. Events are facts for others to observe and act upon with confidence and clarity. Messaging is typically asynchronous with the sender and the reader is decoupled. In a message-driven system, addressable recipients wait for messages to arrive. In an event-driven system, consumers are integrated with sources of events and event stores.

In a reactive system, especially one that uses reactive programming, both events and messages will be present. Messages are a great tool for communication, whereas events are the best bet for unambiguously representing facts. Messages ought to be transmitted across the network and form the basis for communication in distributed systems. Messaging is being used to bridge event-driven systems across the network. Event-driven programming is therefore a simple model in a distributed computing environment. That is not the case with messaging in distributed computing environments. Messaging has to do a lot of things because there are several constraints and challenges in distributed computing. That is, messaging has to tackle things such as partial failures, failure detection, dropped/duplicated/reordered messages, eventual consistency, and managing multiple concurrent realities. These differences in semantics and applicability have intense implications in the application design, including things such as resilience, elasticity, mobility, location transparency, and management complexities of distributed systems.

Reactive systems are highly reliable

Reactive systems fully comply with the reactive manifesto (resilient, responsive, elastic, and message-driven), which was contemplated and released by a group of IT product vendors. A variety of architectural design and decision principles are being formulated and firmed up for building most modernized and cognitive systems that are innately capable of fulfilling todays complicated yet sophisticated requirements. Messages are the most optimal unit of information exchange for reactive systems to function and facilitate. These messages create a kind of temporal boundary between application components. Messages enable application components to be decoupled in time (this allows for concurrency) and in space (this allows for distribution and mobility). This decoupling capability facilitates the much-needed isolation among various application services. Such a decoupling ultimately ensures the much-needed resiliency and elasticity, which are the most sought-after needs for producing reliable systems.

Resilience is about the capability of responsiveness even under failure and is an inherent functional property of the system. Resilience is beyond fault-tolerance, which is all about graceful degradation. It is all about fully recovering from any failure. It is empowering systems to self-diagnose and self-heal. This property requires component isolation and containment of failures to avoid failures spreading to neighboring components. If errors and failure are allowed to cascade into other components, then the whole system is bound to fail.

So, the key to designing, developing, and deploying resilient and self-healing systems is to allow any type of failure to be proactively found and contained, encoded as messages, and sent to supervisor components. These can be monitored, measured, and managed from a safe distance. Here, being message-driven is the greatest enabler. Moving away from tightly coupled systems to loosely and lightly coupled systems is the way forward. With less dependency, the affected component can be singled out, and the spread of errors can be nipped in the bud.

The elasticity of reactive systems

Elasticity is about the capability of responsiveness under a load. Systems can be used by many users suddenly, or a lot of data can be pumped by hundreds of thousands of sensors and devices into the system. To tackle this unplanned rush of users and data, systems have to automatically scale up or out by adding additional resources (bare metal servers, virtual machines, and containers). The cloud environments are innately enabled to be auto-scaling based on varying resource needs. This capability makes systems to use their expensive resources in an optimized manner. When resource utilization goes up, the capital and operational costs of systems comes down sharply.

Systems need to be adaptive enough to perform auto-scaling, replication of state, and behavior, load-balancing, fail-over, and upgrades without any manual intervention, instruction, and interpretation. In short, designing, developing, and deploying reactive systems through messaging is the need of the hour.

Highly reliable IT infrastructures

So far in this chapter, we have concentrated on the application side to ensure IT reliability. But there is a crucial role to play by the underlying IT infrastructures. We have clusters, grids, and clouds to achieve high availability and scalability at the infrastructures level. Clouds are being touted as the best-in-class IT infrastructure for digital innovation, disruption, and transformation. For simplifying, streamlining, and speeding up the cloud setup and sustenance, there came a number of enabling tools for automating the repetitive and routine cloud scheduling, software deployment, cloud administration, and management. Automation and orchestration are being pronounced as the way forward for the ensuing cloud era. Most of the manual activities in running cloud centers are being precisely defined and automated through scripts and other methods. With the number of systems, databases, and middleware, network and storage professionals manning cloud environments has come down drastically. The total cost of ownership (TCO) of cloud IT is declining, whereas the return on investment (RoI) is increasing. The cost benefits of the cloud-enablement of conventional IT environments is greatly noticeable. Cloud computing is typically proclaimed as the marriage between mainframe and modern computing and is famous for attaining high performance while giving cost advantages. The customer-facing applications with varying loads are extremely fit for public cloud environments. Multi-cloud strategies are being worked out by worldwide enterprises to embrace this unique technology without any vendor lock-in.

However, for attaining the much-needed reliability, there are miles to traverse. Automated tools, policies, and other knowledge bases, AI-inspired log and operational analytics, acquiring the capability of preventive, predictive, and prescriptive maintenance through machine and deep learning algorithms and models, scores of reusable resiliency patterns, and preemptive monitoring, are the way forward.

A resilient application is typically highly available, even in the midst of failures and faults. If there is any internal or external attack on a single application component/microservice, the application still functions and delivers its assigned functionality without any delay or stoppage. The failure can be identified and contained within the component so that the other components of the application aren't affected. Typically, multiple instances of the application components are being run in different and distributed containers or VMs so that one component loss does not matter much to the application. Also, application state and behavior information gets stored in separate systems. Any resilient application has to be designed and developed accordingly to survive in any kind of situation. Not only applications but also the underlying IT or cloud infrastructure modules have to be chosen and configured intelligently to support the unique resilient goals of software applications. The first and foremost thing is to fully leverage the distributed computing model. The deployment topology and architecture have to purposefully use the various resiliency design, integration, deployment, and patterns. Plus, the following tips and techniques ought to be employed as per the infrastructure scientists, experts, architects, and specialists:

Use the various network and security solutions, such as firewalls, load balancers, and application delivery controllers (ADCs). Network access control systems (NACLs) also contribute to security. These help out in intelligently investigating every request message to filter out any ambiguous and malevolent message at the source. Furthermore, load balancers continuously probe and monitor servers and distribute traffic to servers that are not fully loaded. Also, they can choose the best server to handle certain requests.
Employ multiple servers at different and distributed cloud centers. That is, disaster recovery capability needs to be part of any IT solution.
Attach a robust and resilient storage solution for data recovery and stateful applications.
Utilize software infrastructure solutions, such as API gateways and management suites, service mesh solutions, additional abstraction layers, to ensure the systems resiliency.
Leveraging the aspects of compartmentalization (virtualization and containerization) have to be incorporated to arrive at virtualized and containerized cloud environments, which intrinsically support the much-needed flexibility, extensibility, elasticity, infrastructure operations automation, distinct maneuverability, and versatility. The software-defined environments are more conducive and constructive for application and infrastructure resiliency.
Focus on log, operational, performance, and scalability analytics to proactive and preemptively monitor, measure, and manage various infrastructure components (software, as well as hardware).

Thus, reliable software applications and infrastructure combine well in rolling out reliable systems that, in turn, guarantee reliable business operations.

In summary, we can say the following:

Reliability = resiliency + elasticity
Automation and orchestration are the key requirements for a reliable IT infrastructure
IT reliability fulfilment—resiliency is to survive under attacks, failures, and faults, whereas elasticity is to auto-scale (vertical and horizontal scalability) under a load
IT infrastructure operational, log, and performance/scalability analytics through AI-inspired analytics platforms
Patterns, processes, platforms, and practices for having reliable IT Infrastructure
System and application monitoring are also significant

The emergence of serverless computing

Serverless computing allows for the building and running of applications and services without thinking about server machines, storage appliances, and arrays and networking solutions. Serverless applications don't require developers to provision, scale, and manage any IT resources to handle serverless applications. It is possible to build a serverless application for nearly any type of applications or backend services. The scalability aspect of serverless applications is being taken care of by cloud service and resource providers. Any spike in load is being closely monitored and acted upon quickly. The developer doesn't need to worry about the infrastructure portions. That is, developers just focus on the business logic and the IT capability being delegated to the cloud teams. This hugely reduced overhead empowers designers and developers to reclaim the time and energy wasted on the IT infrastructure plumping part. Developers typically can focus on other important requirements, such as resiliency and reliability.

The surging popularity of containers comes in handy when automating the relevant aspects to have scores of serverless applications. That is, a function is developed and deployed quickly without worrying about the provisioning, scheduling, configuration, monitoring, and management. Therefore, the new buzzword, FaaS, is gaining a lot of momentum these days. We are moving toward NoOps. That is, most of the cloud operations get automated through a host of technology solutions and tools, and this transition comes in handy for institutions, individuals, and innovators to deploy and deliver their software applications quickly.

On the cost front, users have to pay for the used capacity. Through the automated and dynamic provisioning of resources, the resource utilization goes up significantly. Also, the cost efficiency is fully realized and passed on to the cloud users and subscribers.

Precisely speaking, serverless computing is another and additional abstraction toward automated computing and analytics.

The vitality of the SRE domain

As discussed previously, the software engineering field is going through a number of disruptions and transformations to cope with the growth being achieved in hardware engineering. There are agile, aspect, agent, composition, service-oriented, polyglot, and adaptive programming styles. At the time of writing this book, building reactive and cognitive applications by leveraging competent development frameworks is being stepped up. On the infrastructure side, we have powerful cloud environments as the one-stop IT solution for hosting and running business workloads. Still, there are a number of crucial challenges in achieving the much-wanted cloud operations with less intervention, interpretation, and involvement from human administrators. Already, there are several tasks getting automated via breakthrough algorithms and tools. Still, there are gaps to be filled with technologically powerful solutions. These well-known and widely used tasks include dynamic and automated capacity planning and management, cloud infrastructure provisioning and resource allocation, software deployment and configuration, patching, infrastructure and software monitoring, measurement and management, and so on. Furthermore, these days, software packages are being frequently updated, patched, and released to a production environment to meet emerging and evolving demands of clients, customers, and consumers. Also, the number of application components (microservices) is growing rapidly. In short, the true IT agility has to be ensured through a whole bunch of automated tools. The operational team with the undivided support of SREs has to envision and safeguard highly optimized and organized IT infrastructures to successfully and sagaciously host and run next-generation software applications. Precisely speaking, the brewing challenge is to automate and orchestrate cloud operations. The cloud has to be self-servicing, self-configuring, self-healing, self-diagnosing, self-defending, and self-governing to be autonomic clouds.

The new and emerging SRE domain is being prescribed as the viable way forward. A new breed of software engineers, who have a special liking of system engineering, are being touted as the best fit to be categorized as SREs. These specially skilled engineers are going to train software developers and system administrators to astutely realize highly competent and dependable software solutions, scripts, and automated tools to speedily setup and sustain highly dependable, dynamic, responsive, and programmable IT infrastructures. An SRE team literally cares about anything that makes complex software systems work in production in a risk-free and continuous manner. In short, a site reliability engineer is a hybrid software and system engineer. Due to the ubiquity and usability of cloud centers for meeting the world's IT needs, the word site represents cloud environments.

Site Reliability Engineers usually care about infrastructure orchestration, automated software deployment, proper monitoring and alerting, scalability and capacity estimation, release procedures, disaster preparedness, fail-over and fail-back capabilities, performance engineering and enhancement (PE2), garbage collector tuning, release automation, capacity uplifts, and so on. They will usually also take an interest in good test coverage. SREs are software engineers who specialize in reliability. SREs are expected to apply the proven and promising principles of computer science and engineering to the design and development of enterprise-class, modular, web-scale, and software applications.

The importance of SREs

An SRE is responsible for ensuring the systems availability, performance-monitoring, and incident response of the cloud IT platforms and services. SREs must make sure that all software applications entering production environments fully comply with a set of important requirements, such as diagrams, network topology illustrations, service dependency details, monitoring and logging plans, backups, and so on. A software application may fully comply with all of the functional requirements, but there are other sources for disruption and interruption. There may be hardware degradation, networking problems, high usage of resources, or slow responses from applications, and services could happen at any time. SREs always need to be extremely sensitive and responsive. The SREs effectiveness may be measured as a function of mean time to recover (MTTR) and mean time to failure (MTTF). In other words, the availability of system functions in the midst of failures and faults has to be guaranteed. Similarly, when the system load varies sharply, the system has to have the inherent potential to do scale up and out.

Software developers typically develop the business functionality of the application and do the necessary unit tests for the functionality they created from scratch or composed out of different, distributed, and decentralized services. But they don't always focus on creating and incorporating the code for achieving scalability, availability, reliability, and so on. System administrators, on the other hand, do everything to design, build, and maintain an organization's IT infrastructure (computing, storage, networking, and security). System administrators do try to achieve these QoS attributes through infrastructure sizing and by provisioning additional infrastructural modules (bare metal (BM) servers, virtual machines (VM) servers, and containers) to authoritatively tackle any sudden rush of users and bigger payloads. As described previously, the central goal of DevOps is to build a healthy and working relationship between the operations and the development teams. Any gaps and other friction between developers and operators ought to be identified and eliminated at the earliest by SREs so as to run any application on any machine or cluster without many twists and tweaks. The most critical challenges are how to ensure NFRs/QoS attributes.

SREs solve a very basic yet important problem that administrators and DevOps professionals do not. The infrastructures resiliency and elasticity to safeguard application scalability and reliability has to be ensured. The business continuity and productivity through minute monitoring of business applications and IT services along with other delights for customers, has to be guaranteed. The meeting of the identified NFRs through infrastructure optimization alone is neither viable nor sustainable. NFRs have to be rather realized by skillfully etching in all the relevant code snippets and segments in the application source code itself. In short, the source code for any application has to be made aware of and is capable of easily absorbing the capacity and capability of the underlying infrastructure. That is, we are destined toward the era of infrastructure-aware applications, and, on the other side, we are heading toward application-aware infrastructures.

This is where SREs pitch in. These specially empowered professionals, with all the education, experience, and expertise, are to assist both developers and system administrators to develop, deploy, and deliver highly reliable software systems via software-defined cloud environments. SREs spend half of their time with developers and the other half with operation team to ensure much-needed reliability. SREs set clear and mathematically modeled service-level agreements (SLAs) that set thresholds for the stability and reliability of software applications.

SREs have many skills:

They have a deep knowledge of complex software systems
They are experts in data structures
They are excellent at designing and analyzing computer algorithms

They have a broad understanding of emerging technologies, tools, and techniques
They are passionate when it comes to coding, debugging, and problem-solving
They have strong analytical skills and intuition
They learn quickly from mistakes and eliminate them in the subsequent assignments
They are team players, willing to share the knowledge they have gained and gathered
They like the adrenaline rush of fast-paced work
They are good at reading technical books, blogs, and publications
They produce and publish technology papers, patents, and best practices

Furthermore, SREs learn and position themselves to be a single point of contact (SPOC) in the following areas:

They have a good understanding of code design, analysis, debugging, and optimization.
They have a wide understanding about various IT systems, ranging from applications to appliances (servers, storage, network components (switches, routers, firewalls, load balancers, intrusion detection and prevention systems, and so on)).
They are competent in emerging technologies:
- Software-defined clouds for highly optimized and organized IT infrastructures
- Data analytics for extracting actionable insights in time.
- IoT for people-centric application design and delivery
- Containerization-sponsored DevOps
- FaaS for simplified IT operations
- Enterprise mobility
- Blockchain for IoT data and device security
- AI (machine and deep-learning algorithms) for predictive and prescriptive insights
- Cognitive computing for realizing smarter applications
- Digital twin for performance increment, failure detection, product productivity, and resilient infrastructures

Conversant with a variety of automated tools
Familiar with reliability engineering concept
Well-versed with the key terms and buzzwords such as scalability, availability, maneuverability, extensibility, and dependability
Good at IT systems operations, application performance management, cyber security attacks and solution approaches
Insights-driven IT operations, administration, maintenance, and enhancement

Toolsets that SREs typically use

In the case of SREs, ensuring the stability and the highest uptime of software applications are the top priorities. However, they should have the ability to take the responsibility and code their own way out of hazards, hurdles, and hitches. They cannot add to the to-do lists of the development teams. SREs are typically software engineers with a passion for system, network, storage, and security administration. They have to have the unique strength of development and operations, and they are highly comfortable with a bevy of script languages, automation tools, and other software solutions to speedily automate the various aspects of IT operations, monitoring, and management, especially application performance management, IT infrastructure orchestration, automation, and optimization. Though automation is the key competency of SREs, SREs ought to educate themselves and gain experience to gain expertise in the following technologies and tools:

Object-oriented, functional, and script languages
Digital technologies (cloud, mobility, IoT, data analytics, and security)
Server, storage, network, and security technologies
System, database, middleware, and platform administration
Compartmentalization (virtualization and containerization) paradigms, DevOps tools
The MSA pattern
Design, integration, performance, scalability, and resiliency patterns
Cluster, grid, utility, and cloud computing models
Troubleshooting software and hardware systems
Dynamic capacity planning, task and resource scheduling, workload optimization, VM and container placement, distributed computing, and serverless computing