Our industry is in the midst of the single most significant transformation of its history. We are reaching the tipping point, where every company starts migrating its workloads into the cloud. With this migration comes the realization that systems must be re-architected to fully unlock the potential of the cloud.
In this book, we will journey into the world of cloud-native to discover the patterns and best practices that embody this new architecture. In this chapter, we will define the core concepts and answer the fundamental question: What is cloud-native? You will learn that cloud-native:
- Is more than optimizing for the cloud
- Enables companies to continuously deliver innovation with confidence
- Empowers everyday teams to build massive scale systems
- Is an entirely different way of thinking and reasoning about software architecture
We could dive right into a definition. But if you ask a handful of software engineers to define cloud-native, you will most likely get more than a handful of definitions. Is there no unified definition? Is it not mature enough for a concrete definition? Or maybe everyone has their own perspective; their own context. When we talk about a particular topic without consensus on the context then it is unlikely we will reach consensus on the topic. So first we need to define the context for our definition of cloud-native. It should come as no surprise that in a patterns book, we will start by defining the context.
What is the right context for our definition of cloud-native? Well, of course, the right context is your context. You live in the real world, with real-world problems that you are working to solve. If cloud-native is going to be of any use to you then it needs to help you solve your real-world problems. How shall we define your context? We will start by defining what your context is not.
Your context is not Netflix's context. Certainly, we all aim to operate at that scale and volume, but you need an architecture that will grow with you and not weigh you down now. Netflix did things the way they did because they had to. They were an early cloud adopter and they had to help invent that wheel. And they had the capital and the business case to do so. Unfortunately, I have seen some systems attempt to mimic their architecture, only to virtually collapse under the sheer weight of all the infrastructure components. I still cringe every time I think about their cloud invoices. You don't have to do all the heavy lifting yourself to be cloud-native.
Your context is not the context of the platform vendors. What's past is prologue. Many of us remember all too well the colossal catastrophe that is the Enterprise Service Bus (ESB). I published an article in the Java Developer'sJournal back in 2005 (http://java.sys-con.com/node/84658) in which I was squarely in the ESB is an architecture, not a product camp and in the ESB is event-driven, not request-driven camp. But alas, virtually every vendor slapped an ESB label on its product, and today ESB is a four-letter word. Those of you who lived through this know the big ball of mud I am talking about. For the rest, suffice it to say that we want to learn from our mistakes. Cloud-native is an opportunity to right that ship and go to eleven.
Your context is your business and your customers and what is right for both. You have a lot of great experience, but you know there is a sea change with great potential for your company and you want to know more. In essence, your context is in the majority. You don't have unlimited capital, nor an army of engineers, yet you do have market pressure to deliver innovation yesterday, not tomorrow, much less next month or beyond. You need to cut through the hype. You need to do more with less and do it fast and safe and be ready to scale.
I will venture to say that my context is your context and your context is my context. This is because, in my day job, I work for you; I work alongside you; I do what you do. As a consultant, I work with my customers to help them adopt cloud-native architecture. Over the course of my cloud journey, I have been through all the stages of cloud maturity, I have wrestled with the unique characteristics of the cloud, and I have learned from all the classic mistakes, such as lift and shift. My journey and my context have led me to the cloud-native definition that I share with you here.
In order to round out our context, we need to answer two preliminary questions: Why are we talking about cloud-native in the first place? and Why is cloud-native important?
We are talking about cloud-native because running your applications in the cloud is different from running them in traditional data centers. It is a different dynamic, and we will cover this extensively in this chapter and throughout this book. But for now, know that a system that was architected and optimized to run in a traditional data center cannot take advantage of all the benefits of the cloud, such as elasticity. Thus we need a modern architecture to reap the benefits of the cloud. But just saying that cloud-native systems are architected and optimized to take advantage of the cloud is not enough, because it is not a definition that we can work with.
Fortunately, everyone seems to be in relative agreement on why cloud-native is important. The promise of cloud-native is speed, safety, and scalability. Cloud-native helps companies rapidly deliver innovation to market. Start-ups rely on cloud-native to propel their value proposition into markets and market leaders will need to keep pace. But this high rate of change has the potential to destabilize systems. It most certainly would destabilize legacy architectures. Cloud-native concepts, patterns, and best practices allow companies to continuously deliver with confidence with zero downtime. Cloud-native also enables companies to experiment with product alternatives and adapt to feedback. Cloud-native systems are architected for elastic scalability from the outset to meet the performance expectations of today's consumers. Cloud-native enables even the smallest companies to do large things.
We will use these three promises—speed, safety, and scale, to guide and evaluate our definition of cloud-native within your context.
First and foremost, cloud-native is an entirely different way of thinking and reasoning about software architecture. We literally need to rewire our software engineering brains, not just our systems, to take full advantage of the benefits of the cloud. This, in and of itself, is not easy. So many things that we have done for so long a certain way no longer apply. Other things that we abandoned forever ago are applicable again.
This in many ways relates to the cultural changes that we will discuss later. But what I am talking about here is at the individual level. You have to convince yourself that the paradigm shift of cloud-native is right for you. Many times you will say to yourself "we can't do that", "that's not right", "that won't work", "that's not how we have always done things", followed sooner or later by "wait a minute, maybe we can do that", "what, really, wow", "how can we do more of this". If you ever get the chance, ask my colleague Nate Oster about the "H*LY SH*T!!" moment on a project at a brand name customer.
I finally convinced myself in the summer of 2015, after several years of fighting the cloud and trying to fit it into my long-held notions of software architecture and methodology. I can honestly say that since then I have had more fun building systems and more peace of mind about those systems then ever before. So be open to looking at problems and solutions from a different angle. I'll bet you will find it refreshing as well, once you finally convince yourself.
If you skipped right to this point and you didn't read the preceding sections, then I suggest that you go ahead and take the time to read them now. You are going to have to read them anyway to really understand the context of the definition that follows. If what follows surprises you in any way then keep in mind that cloud-native is a different way of thinking and reasoning about software systems. I will support this definition in the pages that follow, but you will have to convince yourself.
Cloud-native embodies the following concepts:
- Powered by disposable infrastructure
- Composed of bounded, isolated components
- Scales globally
- Embraces disposable architecture
- Leverages value-added cloud services
- Welcomes polyglot cloud
- Empowers self-sufficient, full-stack teams
- Drives cultural change
Of course you are asking, "Where are the containers?" and "What aboutmicroservices?". They are in there, but those are implementation details. We will get to those implementation details in the next chapter and beyond. But implementation details have a tendency to evolve and change over time. For example, my gut tells me that in a year or so we won't be talking much about container schedulers anymore, because they will have become virtually transparent.
This definition of cloud-native should still stand regardless of the implementation details. It should stand until it has driven cultural and organizational change in our industry to the point where we no longer need the definition because, it too, has become virtually transparent.
Let's discuss each of these concepts with regard to how they each help deliver on the promises of cloud-native: speed, safety, and scale.
I think I will remember forever a very specific lunch back in 2013 at the local burrito shop, because it is the exact point at which my cloud-native journey began. My colleague, Tim Nee, was making the case that we were not doing cloud correctly. We were treating it like a data center and not taking advantage of its dynamic nature. We were making the classic mistake called lift and shift. We didn't call it that because I don't think that term was in the mainstream yet. We certainly did not use the phrase disposable infrastructure, because it was not in our vernacular yet. But that is absolutely what the conversation was about. And that conversation has forever changed how we think and reason about software systems.
We had handcrafted AMIs and beautiful snowflake EC2 instances that were named, as I recall, after Star Trek characters or something along those lines. These instances ran 24/7 at probably around 10% utilization, which is very typical. We could create new instances somewhat on demand because we had those handcrafted AMIs. But God forbid we terminate one of those instances because there were still lots of manual steps involved in hooking a new instance up to all the other resources, such as load balancers, elastic block storage, the database, and more. Oh, and what would happen to all the data stored on the now terminated instance?
This brings us to two key points. First, disposing of cloud resources is hard, because it takes a great deal of forethought. When we hear about the cloud we hear about how easy it is to create resources, but we don't hear about how easy it is to dispose of resources. We don't hear about it because it is not easy to dispose of resources. Traditional data center applications are designed to run on snowflake machines that are rarely, if ever, retired. They take up permanent residency on those machines and make massive assumptions about what is configured on those machines and what they can store on those machines. If a machine goes away then you basically have to start over from scratch. Sure, bits and pieces are automated, but since disposability is not a first-class requirement, many steps are left to operations staff to perform manually. When we lift and shift these applications into the cloud, all those assumptions and practices (aka baggage) come along with them.
Second, the machine images and the containers that we hear about are just the tips of the iceberg. There are so many more pieces of infrastructure, such as load balancers, databases, DNS, CDN, block storage, blob storage, certificates, virtual private cloud, routing tables, NAT instances, jump hosts, internet gateways, and so on. All of these resources must be created, managed, monitored, understood as dependencies, and, to varying degrees, disposable. Do not assume that you will only need to automate the AMIs and containers.
The bottom line is: if we can create a resource on demand, we should be able to destroy it on demand as well, and then rinse and repeat. This was a new way of thinking. This notion of disposable infrastructure is the fundamental concept that powers cloud-native. Without disposable infrastructure, the promises of speed, safety, and scale cannot even taxi to the runway, much less take flight, so to speak. To capitalize on disposable infrastructure, everything must be automated, every last drop. We will discuss cloud-native automation in Chapter 6, Deployment. But how do disposable infrastructure and automation help deliver on the promise of speed, safety, and scale?
There is no doubt that our first step on our cloud-native journey increased Dante's velocity. Prior to this step, we regularly delivered new functionality to production every 3 weeks. And every 3 weeks it was quite an event. It was not unusual for the largely manual deployment of the whole monolithic system to take upwards of 3 days before everyone was confident that we could switch traffic from the blue environment to the green environment. And it was typically an all-hands event, with pretty much every member of every team getting sucked in to assist with some issue along the way. This was completely unsustainable. We had to automate.
Once we automated the entire deployment process and once the teams settled into a rhythm with the new approach, we could literally complete an entire deployment in under 3 hours with just a few team members performing any unautomated smoke tests before we switched traffic over to the new stack. Having embraced disposable infrastructure and automation, we could deliver new functionality on any given day. We could deliver patches even faster. Now I admit that automating a monolith is a daunting endeavor. It is an all or nothing effort because it is an all or nothing monolith. Fortunately, the divide and conquer nature of cloud-native systems completely changes the dynamic of automation, as we will discuss in Chapter 6, Deployment.
But the benefits of disposable infrastructure encompass more than just speed. We were able to increase our velocity, not just because we had automated everything, but also because automating everything increased the quality of the system. We call it infrastructure as code for a reason. We develop the automation code using the exact same agile methodologies that we use to develop the rest of the system. Every automation code change is driven by a story, all the code is versioned in the same repository, and the code is continuously tested as part of the CI/CD pipeline, as test environments are created and destroyed with every test run.
The infrastructure becomes immutable because there is no longer a need to make manual changes. As a result, we can be confident that the infrastructure conforms to the requirements spelled out in the stories. This, in turn, leads to more secure systems, because we can assert that the infrastructure is in compliance with regulations, such as PCI and HIPAA. Thus, increased quality makes us more confident that we can safely deploy changes while controlling risk to the system as a whole.
Disposable infrastructure facilitates team scale and efficiency. Team members no longer spend a significant amount of time on deployments and fighting deployment-related fires. As a result, teams are more likely to stay on schedule, which increases team morale, which in turn increases the likelihood that teams can increase their velocity and deliver more value. Yet, disposable infrastructure alone does not provide for scalability in terms of system elasticity. It lays the groundwork for scalability and elasticity, but to fully achieve this a system must be architected as a composition of bounded and isolated components. Our soon-to-be legacy system was still a monolith, at this stage in our cloud maturity journey. It had been optimized a bit, here and there, out of necessity, but it was still a monolith and we were only going to get vertical scaleout of it until we broke it apart by strangling the monolith.
Here are two scenarios I bet we all can relate to. You arrive at work in the morning only to find a firestorm. An important customer encountered a critical bug and it has to be fixed forthwith. The system as a whole is fine, but this specific scenario is a showstopper for this one client. So your team puts everything else on hold, knuckles down, and gets to work on resolving the issue. It turns out to be a one-line code change and a dozen or more lines of test code. By the end of the day, you are confident that you have properly resolved the problem and report to management that you are ready to do a patch release.
However, management understands that this means redeploying the whole monolith, which requires involvement from every team and inevitably something completely unrelated will break as a result of the deployment. So the decision is made to wait a week or so and batch up multiple critical bugs until the logistics of the deployment can be worked out. Meanwhile, your team has fallen one more day behind schedule.
That scenario is bad enough, but I'm sure we have all experienced worse. For example, a bug that leads to a runaway memory leak, which cripples the monolith for every customer. The system is unusable until a patch is deployed. You have to work faster than you want to and hope you don't miss something important. Management is forced to organize an emergency deployment. The system is stabilized and everyone hopes there weren't any unintended side effects.
The first scenario shows how a monolithic system itself can become the bottleneck to its own advancement, while the second scenario shows how the system can be its own Achilles heel. In cloud-native systems, we avoid problems such as these by decomposing the system into bounded isolated components. Bounded components are focused. They follow the single responsibility principle. As a result, these components are easier for teams to reason about. In the first scenario, the team and everyone else could be confident that the fix to the problem did not cause a side effect to another unrelated piece of code in the deployment unit because there is no unrelated code in the deployment unit. This confidence, in turn, eliminates the system as its own bottleneck. Teams can quickly and continuously deploy patches and innovations. This enables teams to perform experiments with small changes because they know they can quickly roll forward with another patch. This ability to experiment and gain insights further enables teams to rapidly deliver innovation.
So long as humans build systems, there will be human error. Automation and disposable infrastructure help minimize the potential for these errors and they allow us to rapidly recover from such errors, but they cannot eliminate these errors. Thus, cloud-native systems must be resilient to human error. To be resilient, we need to isolate the components from each other to avoid the second scenario, where a problem in one piece affects the whole. Isolation allows errors to be contained within a single component and not ripple across components. Other components can operate unabated while the broken component is quickly repaired.
Isolation further instills confidence to innovate, because the blast radius of any unforeseen error is controlled. Bounded and isolated components achieve resilience through data replication. This, in turn, facilitates responsiveness, because components do not need to rely on synchronous inter-component communication. Instead, requests are serviced from local materialized views. Replication also facilitates scale, as load is spread across many independent data sources. In Chapter 2, The Anatomy of a Cloud Native Systems, we will dive into these topics of bounded contexts, isolation and bulkheads, reactive architecture, and turning the database inside out.
So when a young buck colleague named Mike Donovan suggested that we really needed to look at the then new thing called Angular, my initial reaction was “oh no, not again”. However, I strive to stay in touch with my inner Bruce Pujanauski. Bruce was a seasoned mainframe guru back when I was a young buck C++ programmer. We were working on a large project to port a mainframe-based ERP system to an N-tier architecture. Bruce pointed out that we were re-inventing all the same wheels that they had already perfected on the mainframe, but as far as he could tell we were on the right track. Bruce understood that the context of the industry was changing and a new generation of engineers was going to be playing a major role and he was ready to embrace the change. That moment made a lasting impression on me. So much so, that I don't think of the UI pendulum as swinging back and forth. Instead, I see it and software architecture in general as zigzagging through time, constantly adjusting to the current context.
Running the presentation tier on the edge of the cloud like this was a game changer. It enabled virtually limitless global scale, for that tier, at virtually no cost. What followed was a true “how can we do more of this” moment. How can we achieve this at the business layer and the data layer? How can we run more at the edge of the cloud? How can we easily, efficiently, and cost-effectively support multi-regional, active-active deployments? Our journey through this book will show us how. We will push the API Gateway to the edge, enforce security at the edge, cache responses at the edge, store users' personal data on devices, replicate data between components and across regions, and more. For now, suffice it to say that scalability, even global scalability, no longer keeps me awake at night.
Not enough emphasis is placed on the Big R in conversations about cloud-native. Independent DURS ultimately comes up in every discussion on cloud-native concepts; to independently Deploy, Update, Replace, and Scale. The focus is inevitably placed on the first and the last, Deploy and Scale, respectively. Of course, Update is really just another word for Deploy, so it doesn't need much additional attention. But Replace is treated like a redheaded stepchild and only given a passing glance.
I think this is because the Big R is a crucial, higher-order concept in cloud-native, but many discussions on cloud-native are down in the technical weeds. There is no doubt, it is essential that we leverage disposable infrastructure to independently deploy and scale bounded isolated components. But this is just the beginning of the possibilities. In turn, disposable architecture builds on this foundation, takes the idea of disposability and replacement to the next level, and drives business value further. At this higher level, we are driving a wedge in monolithic thinking at the business level.
The monolith is etched on our brains and permeates our way of thinking. It leads us to architectural and business decisions that may be optimal in the context of the monolith, but not in the context of cloud-native. Monolithic thinking is an all or nothing mindset. When something has to be all or nothing, it frequently leads us to avoid risk, even when the payoff could be significant if we could only approach it in smaller steps. It just as frequently drives us to take extreme risk, when it is perceived that we have no choice because the end game is believed to be a necessity.
Disposable architecture (aka the Big R) is the antithesis of monolithic thinking. We have decomposed the cloud-native system into bounded isolated components and disposable infrastructure accelerates our ability to deploy and scale these components. One rule of thumb, regarding the appropriate size of a component, is that its initial development should be scoped to about 2 weeks. At this low level of investment per component, we are at liberty to experiment with alternatives to find an optimal solution. To put this in business terms, each experiment is the cost of information. With a monolith, we are more likely to live with a suboptimal solution. The usual argument is that the cost of replacement outweighs the ongoing cost of the suboptimal solution. But in reality, the budget was simply blown building the wrong solution.
In his book, Domain Driven Design: Tackling Complexity in the Heart of Software (http://dddcommunity.org/book/evans_2003/), Eric Evans discusses the idea of the breakthrough. Teams continuously and iteratively refactor towards deeper insight with the objective of reaching a model that properly reflects the domain. Such a model should be easier to relate to when communicating with domain experts and thus make it safer and easier to reason about fixes and enhancements. This refactoring typically proceeds at a linear pace, until there is a breakthrough. A breakthrough is when the team realizes that there is a deep design flaw in the model that must be corrected. But breakthroughs typically require a high degree of refactoring.
Breakthroughs are the objective of disposable architecture. No one likes to make important decisions based on incomplete and/or inaccurate information. With disposable architecture, we can make small incremental investments to garner the knowledge necessary to glean the optimal solution. These breakthroughs may require completely reworking a component, but that initial work was just the cost of acquiring the information and knowledge that led to the breakthrough. In essence, disposable architecture allows us to minimize waste. We safely and wisely expend our development resources on controlled experiments and in the end get more value for that investment. We will discuss the topic of lean experiments and the related topic of decoupling deployment from release in Chapter 6, Deployment. Yet, to embrace disposable architecture, we need more than just disposable infrastructure and lean methods; we need to leverage value-added cloud services.
This is perhaps one of the most intuitive, yet simultaneously the most alienated concepts of cloud-native. When we started to dismantle our monolith, I made a conscious decision to fully leverage the value-added services of our cloud provider. Our monolith just leveraged the cloud for its infrastructure-as-a-service. That was a big improvement, as we have already discussed. Disposable infrastructure allowed us to move fast, but we wanted to move faster. Even when there were open source alternatives available, we chose to use the cloud provided (that is, cloud-native) service.
What could be more cloud-native than using the native services of the cloud providers? It did not matter that there were already containers defined for the open source alternatives. As I have mentioned previously and will repeat many times, the containers are only the tip of the iceberg. I will repeat this many times throughout the book because it is good to repeat important points. A great deal of forethought, effort, and care are required for any and every service that you will be running on your own. It is the rest of the iceberg that keeps me up at night. How long does it take to really understand the ins and outs of these open source services before you can really run them in production with confidence? How many of these services can your team realistically build expertise in, all at the same time? How many "gotchas" will you run into at the least opportune time?
Many of these open source services are data focused, such as databases and messaging. For all my customers, and I'll assert for most companies, data is the value proposition. How much risk are you willing to assume with regard to the data that is your bread and butter? Are you certain that you will not lose any of that data? Do you have sufficient redundancy? Do you have a comprehensive backup and restore process? Do you have monitoring in place, so that you will know in advance and have ample time to grow your storage space? Have you hardened your operating system and locked down every last back door?
The bulk of the patterns in this book revolve around data. Cloud-native is about more than just scaling components. It is ultimately about scaling your data. Gone are the days of the monolithic database. Each component will have multiple dedicated databases of various types. This is an approach called polyglot persistence that we will discuss shortly. It will require your teams to own and operate many different types of persistence services. Is this where you want to place your time and effort? Or do you want to focus your efforts on your value proposition?
By leveraging the value-added services of our cloud provider, we cut months, if not more, off our ramp-up time and minimize our operational risk. Leveraging value-added cloud services gave us confidence that the services were operated properly. We could be certain that the services would scale and grow with us, as we needed them to. In some cases, the cloud services only have a single dial that you turn. We simply needed to hook up our third-party monitoring service to observe the metrics provided by these value-added services and focus on the alerts that were important to our components. We will discuss alerting and observability in Chapter 8, Monitoring.
This concept is also the most alienated, because of the fear of vendor lock-in. But vendor lock-in is monolithic thinking. In cloud-native systems, we make decisions on a component-by-component basis. We embrace disposable architecture and leverage value-added cloud services to increase our velocity, increase our knowledge, and minimize our risk. By leveraging the value-added services of our cloud provider, we gave ourselves time to learn more about all the new services and techniques. In many cases, we did not know if a specific type of service was going to meet our needs. Using the cloud-provided services was just the cost of acquiring the information and knowledge we needed to make informed long-term decisions. We were willing to outgrow a cloud-provided service, instead of growing into a service we ran ourselves. Maybe we would never outgrow the cloud-provided service. Maybe we would never grow into a service we ran ourselves.
It is important to have an exit strategy, in case you do outgrow a value-added cloud service. Fortunately, with bounded isolated components, we can exit one component at a time and not necessarily for all components. For example, a specific cloud provider service may be perfect for all but a small few of your components. In Chapter 3, Foundation Patterns, we will discuss the Event Sourcing and Data Lake patterns that are the foundation for any such exit strategy and conversion.
The willingness to welcome polyglot cloud is a true measure of cloud-native maturity. Let's start with something more familiar: polyglot programming. Polyglot programming is the idea that on a component-by-component basis, we will choose to use the programming language that best suits the requirements of the specific component. An organization, team, or individual will typically have a favorite or go-to language but will use another language for a specific component when it is more appropriate.
Moving on, polyglot persistence is a topic we will cover in depth in this book. This too is the idea that on a component-by-component basis, we will choose to use the storage mechanism that best suits the requirements of the specific component. One unique characteristic of polyglot persistence is that we will often use multiple storage mechanisms within a single component in an effort to get the absolute best performance and scalability for the specific workload. Optimal persistence is crucial for global scale, cloud-native systems. The advancements in the persistence layer are in many ways far more significant than any other cloud-native topic. We will be discussing cloud-native database concepts throughout this book.
Polyglot cloud is the next, logical, evolutionary step. In a mature cloud-native system, you will begin to choose the cloud provider that is best on a component-by-component basis. You will have a go-to cloud provider for the majority of your components. However, you will find that competition is driving cloud providers to innovate at an ever-faster pace, and many of these innovations will help give your system its own competitive advantage, but alas your go-to cloud provider's offering in a specific case may not be sufficient.
Does this mean that you will need to lift and shift your cloud-native system from one cloud provider to another? No, absolutely not. It will take less time and effort and incur less risk to perform a quick, potentially disposable, experiment with the additional provider. After that, there will be a compelling reason to try a feature offered by another provider, and another, and so on. Ultimately, it will not be possible to run your cloud-native system on just one provider. Thus, a mature cloud-native system will be characterized by polyglot cloud. It is inevitable, which is why I welcome it in our cloud-native definition.
Today, does your company use both Office 365 and AWS? The answer to this question, or one of its many permutations, is very often, yes. If so, then you are already polyglot cloud and you should find ways to integrate and leverage both. Has your company acquired another firm that uses another cloud provider? Is it possible that your company might acquire such a company? Can you think of any other potential scenarios that will inevitably lead your company to be polyglot cloud? I suspect most companies are already polyglot cloud without realizing it.
It is important to make a distinction between polyglot cloud and another common term, multi-cloud. Polyglot cloud and multi-cloud are different. They come from completely different mindsets. Multi-cloud is the idea that you can write your cloud-native system once and run it on multiple cloud providers, either concurrently for redundancy or in an effort to be provider-agnostic. We have seen this before in the Java application server market and it didn't pan out as promised. It did not pan out because the providers offered compelling value-added features that locked in your monolith. With cloud-native, we do not have a monolith. Thus, vendor lock-in is only on a component-by-component basis. There will always be compelling reasons to leverage vendor-specific offerings, as we have already discussed. A least common denominator approach does not provide for this flexibility. Thus we welcome polyglot cloud.
Furthermore, moving an entire system from one provider to another is a high-risk proposition. Big-bang migrations have always been and will always be problematic. Migrating a cloud-native system to another cloud provider is no different. This holds true even when a cloud abstraction platform is employed. This is because your system on top of the cloud abstraction platform is just the tip of the iceberg. All the real risk lies below the surface. A cloud abstraction layer adds significant project overhead to your system in its infancy when it is needed the least, requires more operational effort throughout the product life cycle, and does not mitigate the significant risks of migration should you ever actually need to do so.
Essentially, multi-cloud is characteristic of the monolithic, all or nothing thinking of the past. Polyglot cloud focuses instead on the promise of cloud-native. Welcoming polyglot cloud up front keeps your system lean, avoids unneeded layers and overhead, and frees your team to choose the right provider on a component-by-component basis.
Developers often ask me how they can learn to write infrastructure as code when they do not know enough about the infrastructure resources they are deploying. I typically respond with several probing questions, such as:
- When you were first hired straight out of school as a programmer, were you hired because you were an expert in that company's domain? No. The domain experts gave you the requirements and you wrote the software.
- Did you eventually become knowledgeable in that domain? Yes, absolutely.
- Have you ever been given just the documentation of an external system's API and then asked to go figure out how to integrate your system with that system? Sure. You read the documentation, you asked follow up questions, and you made it happen.
Then I just reiterate what they have already begun to realize. An infrastructure resource is just another functional domain that happens to be a technical domain. In the cloud, these resources are API-driven. Ultimately, we are just integrating our build pipelines with these APIs. It is just code. We work our agile development methodology, write the code, ask questions, and iterate. More and more we can do these deployments declaratively. In Chapter 6, Deployment, we will dive into more detail about how to do deployments and integrate these tasks into the development methodology.
Sooner than later, we all become domain experts of the cloud-native resources we leverage. We start simple. Then as our components mature, they put more demands on these resources and we learn to leverage them as we go. As we discussed previously, this is why value-added cloud services are so valuable. They give us a strong foundation when we know the least and grow with us as we gain expertise. This is one of the ways that being cloud-native empowers teams. The cloud works on a shared responsibility model. Everything below a certain line is the responsibility of the cloud provider. Everything above this line is the responsibility of your team. As you use more and more value-added cloud services, such as database-as-a-service and function-as-a-service, that line is drawn higher and higher. Your team becomes self-sufficient by leveraging the power of value-added cloud services, which allows you to focus on the value proposition of your components. Your team controls the full-stack because you can provision your required resources at will.
You may have heard of Conway's Law:
"organizations are constrained to produce application designs which are copies of their communication structures"
Let's put this in the context of an example. Many companies are organized around the architectures of N-tiered monoliths and data centers. Teams are organized by skillset: user interface, backend services, database administrators, testers, operations, and so forth. Conway's Law essentially says that organizations like this will not successfully implement cloud-native systems.
Cloud-native systems are composed of bounded isolated components. These components own all their resources. As such, self-sufficient, full-stack teams must own the components and their resources. Otherwise, the communication and coordination overhead across horizontal teams will marginalize the benefits of bounded isolated components and ultimately tear them apart. Instead, companies need to re-organize to reflect the desired system architecture.
Self-sufficient, full-stack teams own one or more components for the entirety of each component's full life cycle. This is often referred to as the you build it, you run it mentality. The result is that teams tend to build in quality up front because they will be directly on the hook when the component is broken. The patterns in this book for creating bounded isolated components help teams control their own destiny by controlling the upstream and downstream dependencies of their components. In Chapter 6, Deployment, Chapter 7, Testing, and Chapter 8, Monitoring, we will discuss the techniques that teams can leverage to help ensure their confidence in their components. Self-sufficient, full-stack teams are at liberty to continuously deliver innovation at their own pace, they are on the hook to deliver safely, and companies can scale by adding more teams.
It was a big event when we completed our first fully automated deployment of our monolith using disposable infrastructure. There was a celebration. It was a big deal. Then we did another deployment in a couple of days and the naysayers nodded their heads. Then we did another deployment the next day and congratulatory emails went out. Before long we could do a deployment, even multiple deployments, on any given day, without even a whisper or a pat on the back. Deployments had become non-events. Successful, uneventful deployments had become an expectation. The culture had begun to change.
We still have celebrations. We have them when we complete a feature release. But a release is now just a marketing event; it is not a development event. They are milestones that we work towards. We have completely decoupled deployment from release. Components are deployed to production with the completion of every task. A release is made up of a large number of small task scoped deployments. The last task deployment of a release could very well have happened weeks before we flipped the feature on for general availability. In Chapter 6, Deployment, we will discuss the mechanics of decoupling deployment from release. In Chapter 7, Testing and Chapter 8, Monitoring, we will discuss the techniques that help ensure that deployments can safely happen at this pace. Many of these techniques were considered heresy until their value was demonstrated.
Ultimately, cultural change comes down to trust. Trust is earned. We must incrementally demonstrate that we can execute and deliver on the promise of cloud-native. The monolith has been the cultural norm forever, in software terms. It is a mindset; it is a way of thinking. Many software-related business practices and policies revolve around the expectations set by the realities of monolithic systems. This is why cloud-native must drive cultural change. When we show that the downstream practices truly can deliver, only then can the upstream practices and policies really begin to change and embrace lean thinking. With cloud-native driven lean thinking, the pace of innovation really accelerates through experimentation. The business can quickly and safely adjust course based on market feedback and scale by minimizing the effort required to perform the business experimentation. All the while knowing that the solutions can scale when the experiments prove fruitful.
In this chapter, we covered the fundamental concepts of cloud-native systems. Our definition of cloud-native is focused on your context. You need an architecture that will grow with you and not weigh you down now. Cloud-native is more than architecting and optimizing to take advantage of the cloud. It is an entirely different way of thinking and reasoning about software architecture and development practices. Cloud-native breaks free of monolithic thinking to empower self-sufficient teams that continuously deliver innovation with confidence. This confidence is derived from the knowledge that cloud-native systems are powered by disposable infrastructure, composed of bounded isolated components, and scale globally, so that they remain responsive in the face of failures. Cloud-native teams embrace disposable architecture, leverage value-added cloud services, and welcome polyglot cloud to provide the strong foundation that enables them to take control of the full-stack, focus on the value proposition, and drive cultural change from the bottom up by earning trust through successful execution.