You're reading from Multi-Cloud Strategy for Cloud Architects - Second Edition

Product type Book

Published in Apr 2023

Publisher Packt

ISBN-13 9781804616734

Pages 470 pages

Edition 2nd Edition

Languages

Concepts

Cloud Computing

Author (1):

Jeroen Mulder

Table of Contents (23) Chapters

Preface

1. Introduction to Multi-Cloud

2. Collecting Business Requirements

3. Starting the Multi-Cloud Journey

4. Service Designs for Multi-Cloud

5. Managing the Enterprise Cloud Architecture

6. Controlling the Foundation Using Well-Architected Frameworks

7. Designing Applications for Multi-Cloud

8. Creating a Foundation for Data Platforms

9. Creating a Foundation for IoT

10. Managing Costs with FinOps

11. Maturing FinOps

12. Cost Modeling in the Cloud

13. Implementing DevSecOps

14. Defining Security Policies

15. Implementing Identity and Access Management

16. Defining Security Policies for Data

17. Implementing and Integrating Security Monitoring

18. Developing for Multi-Cloud with DevOps and DevSecOps

19. Introducing AIOps and GreenOps in Multi-Cloud

20. Conclusion: The Future of Multi-Cloud

21. Other Books You May Enjoy

22. Index

Conclusion: The Future of Multi-Cloud

This book has dealt with designing, implementing, and controlling a multi-cloud platform. We talked about five major clouds—Azure, AWS, GCP, Oracle Cloud, and Alibaba Cloud—and discussed strategies to get the best out of these clouds for our businesses. We discovered that building and managing in the cloud can be complex. Yet, the cloud will definitively grow. We will look at the future of the cloud in this final chapter.

The cloud will grow and multi-cloud will grow. The biggest challenge is how organizations can stay in control of their applications in a multi-cloud setting since the cloud can become very complex. Maybe Google has the answer: Site Reliability Engineering (SRE). SRE incorporates aspects of software engineering and applies them to infrastructure and operations problems. We will also use this chapter to introduce the concept of SRE and its main principles.

In this chapter, we’re going to cover the...

The growth and adoption of multi-cloud

In recent years, multi-cloud has emerged as a popular approach for businesses to manage their cloud infrastructure. Let’s recap the definition of multi-cloud one more time: we speak about multi-cloud when we use two or more cloud service providers to host and run applications and services. As we look toward the near future, we can expect to see continued developments in multi-cloud as businesses seek to take advantage of its benefits while managing its risks. We’ll talk about managing risks later in this chapter when we explore the concept of SRE.

One of the primary reasons that businesses are looking more into multi-cloud is the need for flexibility and agility. Multi-cloud allows businesses to avoid vendor lock-in and take advantage of the unique features and capabilities offered by different cloud providers. This allows them to optimize their applications and services for specific use cases, such as high-performance computing...

Understanding the concept of SRE

Originally, SRE was meant for mission-critical systems, but overall, it can be used to drive the DevOps process in a more efficient way. The goal is to enable developers to deploy infrastructure quickly and without errors. To achieve this, the deployment is fully automated. In this way of working, operators will not be swamped with requests to constantly onboard and manage more systems.

The original description of SRE as invented by Google is well over 400 pages long. In the Further reading section, a good book is listed to give you a real deep dive into SRE. This chapter is merely an introduction.

Key terms in SRE are service-level indicators (SLIs), SLO, and the error budget, or the number of failures that lead to the unavailability of a system. The terms are explained in more detail in the next paragraphs.

SLI and SLO differ from SLA, the service-level agreement. The SLA is an agreement between the supplier of a service and the end user...

Working with risk analysis in SRE

The basis of SRE is that reliability is something that you can design as part of the architecture of applications and systems. Next to that, reliability is also something that one can measure. According to SRE, reliability is a measurable quality, and that quality can be influenced by design decisions. Engineers can take measures to decrease the detection, response, and repair time, and they can develop systems in such a way that changes can be executed safely without causing any downtime. Architects can design fault-tolerant systems; engineers can develop them.

The major issue is it all comes at a cost, and whether systems really need to be fault-tolerant is a business decision, based on a business case. Already, in Chapter 1, Introduction to Multi-Cloud, we’ve learned that business cases are driven by risks. Let’s go over risk management one more time.

The basic rule is that risk = probability x impact. Enterprises use risk...

Applying monitoring principles in SRE

Reliability is a measurable quality. To be able to measure the quality of the systems and their reliability, teams need real-time information on the status of these systems. As mentioned in the previous section, the TTD is a crucial driver in calculating risk and, subsequently, determining the SLO. Observability is therefore critical in SRE. However, SRE stands with the principle that monitoring needs to be as simple as possible. It uses the four golden signals:

Latency: The time that a system needs to return a response.
Traffic: The amount of traffic that is placed on the system.
Errors: The number of requests placed on a system that fail completely or partially.
Saturation: The utilization of the maximum load that a system can handle.

Based on these signals, monitoring rules are defined. As the starting point in SRE is avoiding too much work for operations or toil, the monitoring rules follow the same philosophy...

Applying principles of SRE to multi-cloud—building and operating distributed systems

This book exists because a majority of enterprises are moving or developing systems in cloud environments. Today’s enterprises are in a constant transformation mode. This also means a big change in operations. To put it simply, they have to keep up with the speed of change. Traditional operations can’t handle this. We need SRE in the future of multi-cloud. SRE teams create reliable systems in cloud environments.

There are a couple of important rules for SRE to enable this:

Automate everything: Automation leads to consistency, but automation also enables scaling. This requires a very well-thought-out architecture. Automation enables issues to be fixed faster since it only has to be fixed in one place: the code. Automation makes sure that the proper code is distributed over all systems involved. With large distributed systems spanning various cloud platforms, this...

Summary

Systems are getting more complex for many reasons: customers constantly demand more functionality in applications. At the same time, systems need to be available 24/7 without interruption. Cloud platforms are very suitable to facilitate development at high speed, and thus we foresee cloud providers growing fast. In other words, the cloud will definitively grow. This comes with challenges for a lot of businesses. Throughout this book, we discovered that building and managing cloud environments can be complex.

The cloud will grow, and likely the complexity of the cloud will grow too. To ensure reliability, especially with systems that are truly multi-cloud and distributed across different platforms, we should adopt the principles of SRE. The most important principles of SRE have been discussed in this chapter. You should have an understanding of the methodology, based on determining the SLO, measuring the SLI, and working with error budgets.

We’ve learned that...

Questions

Risk analysis is important in SRE. What are the five risk strategies, often referred to as PRACT?
SRE mentions four golden signals in applying monitoring rules. Latency and traffic are two of them. Name the remaining two.
SRE has a specific term for manual work that is often repetitive and should be avoided. What’s that term?
Postmortem analysis is a key principle in SRE. True or false: Postmortem analysis is about finding the root cause and finding out who’s to blame for the error.