The advent of the cloud has led to a new paradigm in designing, implementing, and ongoing maintenance of computer systems. While there are many different names for this new paradigm, the one most commonly used is cloud native architectures. In this book, we will explore what exactly cloud native architectures are, why they are new and different, and how they are being implemented across a wide range of global companies. As the name suggests, it's all about the cloud and using cloud vendor services to design these architectures to solve business problems in new, robust, and secure ways. The purpose of this chapter is to explain and define what cloud native architectures are, and provide some insights into the pros, cons, and myths of cloud native architectures. We will explore what it means to be cloud native, understand the spectrum and components that are required for this type of architecture, and appreciate the journey that a company would need to undertake to move up in maturity on the model.
If you asked 100 people what the definition of cloud native was, you just might get 100 different answers. Why are there so many different answers? To start with, cloud computing itself is still evolving every day, so the definitions offered a few years ago are quite possibly not fully up to date with the current state of the cloud. Secondly, cloud native architectures are a completely new paradigm that use new methods to solve business problems that can typically only be achieved at the scale of cloud computing. Finally, depending on the role of the person being asked, the definition is very different, whether they be an architect, developer, administrator, or decision maker. So, what exactly is the definition of cloud native?
"Cloud computing is the on-demand delivery of compute power, database storage, applications, and other IT resources through a cloud services platform via the internet with pay-as-you-go pricing."
Therefore, at its most basic form, cloud native means to embrace cloud computing services to design the solution; however, that only covers part of what is required to become cloud native. There is a lot more than just using the underlying cloud infrastructure, even if it's the most mature service available.
Automation and application design play significant roles in this process as well. The cloud, with its API-driven design, allows for extreme automation at scale to not only create instances or specific systems, but to also completely roll out an entire corporate landscape with no human interaction. Finally, a critical component in creating a cloud native architecture is the approach used to design a specific application. Systems designed with the best cloud services, and deployed with extreme automation, can still fail to achieve desired results if the logic of the application does not take into consideration the new scale at which it can operate.
There is no one right answer to what a cloud native architecture is; many types of architectures could fall into the cloud native category. Using the three design principles or axes—cloud native services, application centric design, and automation—most systems can be evaluated for their level of cloud native maturity. In addition, since these principles are ever expanding as new technologies, techniques or design patterns are developed, and so the maturity of cloud native architectures will continue to mature. We, the authors of this book, believe that cloud native architectures are formed by evolution and fall into a maturity model. For the remainder of this book, cloud native architectures will be described using the Cloud Native Maturity Model (CNMM), following the design principles outlined, so that architecture patterns can be mapped to their point of evolution:
To understand where a system will fall on the CNMM, it's important to understand what the components of cloud native architecture are. By definition, being cloud native requires the adoption of cloud services. Each cloud vendor will have its own set of services, with the most mature having the richest set of features. The incorporation of these services, from basic building blocks to the most advanced, cutting-edge technologies, will define how sophisticated a cloud native architecture is on the cloud services axis:
Amazon Web Services (AWS) is often cited as the most advanced cloud platform (at the time of writing). The following diagram shows all the services that AWS has to offer, from basic building blocks, to managed service offerings, to advanced platform services:
Regardless of the level of maturity of the cloud vendor, they will have the building blocks of infrastructure, which include compute, storage, networking, and monitoring. Depending on the cloud maturity level of an organization and the team designing a system, it might be common to begin the cloud native journey by leveraging these baseline infrastructure building blocks. Virtual server instances, block disk storage, object storage, fiber lines and VPNs, load balancers, cloud API monitoring, and instance monitoring are all types of building blocks that a customer would use to start consuming the cloud. Similar to what would be available in an existing on-premises data center, these services would allow for a familiar look and feel for design teams to start creating applications in the cloud. The adoption of these services would be considered the bare minimum required to develop a cloud native architecture, and would result in a relatively low level on the cloud native services axis.
Often, a company will choose to migrate an existing application to the cloud and perform the migration in a lift-and-shift model. This approach would literally move the application stack and surrounding components to the cloud with no changes to the design, technology, or component architecture. Therefore, these migrations only use the basic building blocks that the cloud offers, since that is what is also in place at the customer-on-premises locations. While this is a low level of maturity, it allows for something critical to happen: gaining experience with how the cloud works. Even when using the cloud services building blocks, the design team will quickly add their own guard rails, policies, and naming conventions to learn more efficient techniques to deal with security, deployments, networking, and other core requirements for an early-stage cloud native system.
One of the key outcomes that a company will gain from this maturity stage is the basic premise of the cloud and how that impacts their design patterns: for example, horizontal scaling versus vertical scaling, and the price implications of these designs and how to implement them efficiently. In addition, learning how the chosen cloud vendor operates and groups their services in specific locations and the interactions between these groupings to design high availability and disaster recovery through architecture. Finally, learning the cloud approach to storage and the ability to offload processing to cloud services that scale efficiently and natively on the platform is a critical approach to designing architectures. Even though the adoption of cloud services building blocks is a relatively low level of maturity, it is critical for companies that are beginning their cloud journey.
Undifferentiated heavy lifting is often used to describe when time, effort, resources, or money are deployed to perform tasks that will not add to the bottom line of a company. Undifferentiated simply means that there is nothing to distinguish the activity from the way others do it. Heavy lifting refers to the difficult task of technology innovation and operations which, if done correctly, nobody ever recognizes, and if done wrong, can cause catastrophic consequences to business operations. These phrases combined mean that when a company does difficult tasks—that if done wrong will cause business impact, but that company doesn't have a core competency to distinguish itself in doing these tasks—it not only doesn't add business value, but can easily detract from the business.
Unfortunately, this describes a large majority of the IT operations for enterprise companies, and is a critical selling point to using the cloud. Cloud vendors do have a core competency in technology innovation and operations at a scale that most companies could never dream of. Therefore, it only makes sense that cloud vendors have matured their services to include managed offerings, where they own the management of all aspects of the service and the consumer only needs to develop the business logic or data being deployed to the service. This will allow the undifferentiated heavy lifting to be shifted from the company to the cloud vendor, and allow that company to dedicate significantly more resources to creating business value (their differentiators).
As we have seen, there are lots of combinations of cloud services that can be used to design cloud native architectures using only the basic building blocks and patterns; however, as a design team grows in their understanding of the chosen cloud vendor's services and becomes more mature in their approach, they will undoubtedly want to use more advanced cloud services. Mature cloud vendors will have managed service offerings that are often able to replace components that require undifferentiated heavy lifting. Managed service offerings from cloud vendors would include the following:
- Directory services
- Load balancers
- Caching systems
- Data warehouses
- Code repositories
- Automation tools
- Elastic searching tools
Another area of importance for these services is the agility they bring to a solution. If a system were designed to use these tools but managed by the company operations team, often the process to provision the virtual instance, configure, tune, and secure the package will significantly slow the progress being made by the design team. Using cloud vendor managed service offerings in place of those self-managed components will allow the teams to implement the architecture quickly, and begin the process of testing the applications that will run in that environment.
Using managed service offerings from a cloud vendor doesn't necessarily lead to more advanced architecture patterns; however, it does lead to the ability to think bigger and not be constrained by undifferentiated heavy lifting. This concept of not being constrained by limitations that are normally found on-premises, like finite physical resources, is a critical design attribute when creating cloud native architectures, which will enable systems to reach a scale hard to achieve elsewhere. For example, some areas where using a cloud vendor managed service offering would allow for a more scalable and native cloud architecture are as follows:
- Using managed load balancers to decouple components in an architecture
- Leveraging managed data warehouse systems to provision only the storage required and letting it scale automatically as more data is imported
- Using managed RDBMS database engines to enable quick and efficient transactional processing with durability and high availability built in
Cloud vendors are on their journey and continue to mature service offerings to ever more sophisticated levels. The current wave of innovation among the top cloud vendors suggests a continued move into more and more services that are managed by the vendor at scale and security, and reduces the cost, complexity, and risk to the customer. One way they are achieving this is by re-engineering existing technology platforms to not only deal with a specific technical problem, but by also doing it with the cloud value proposition in mind. For example, AWS has a managed database service, Amazon Aurora, built on a fully distributed and self-healing storage service deigned to keep data safe and distributed across multiple availability zones. This service increases the usefulness of the managed service offerings specific to databases, as described in the previous section, by also allowing for a storage array that grows on demand and is tuned for the cloud to provide performance of up to five times better than a similar database engine.
Not all advanced managed services are re-engineered ideas of existing technology; with the introduction of serverless computing, cloud vendors are removing undifferentiated heavy lifting away from not only operations, but also the development cycle. With the virtually limitless resources the cloud can offer, decoupling large applications into individual functions is the next big wave of distributed systems design, and leads directly into the formation of cloud native architectures.
According to AWS:
"Serverless computing allows you to build and run applications and services without thinking about servers. Serverless applications don't require you to provision, scale, and manage any servers. You can build them for virtually any type of application or backend service, and everything required to run and scale your application with high availability is handled for you."
There are many types of serverless services, including compute, API proxies, storage, databases, message processing, orchestration, analytics, and developer tools. One key attribute to define whether a cloud service is serverless or only a managed offering is the way that licensing and usage are priced. Serverless leaves behind the old-world model of core-based pricing, which would imply it is tied directly to a server instance, and relies more on consumption-based pricing. The length of time a function runs, the amount of transactions per second required, or a combination of these, are common metrics that are used for consumption-based pricing with serverless offerings.
Using these existing advanced cloud native managed services, and continuing to adopt the new ones as they are released, represents a high level of maturity on the CNMM, and will enable companies to develop some of the most advanced cloud native architectures. With these services and an experienced design team, a company will be able to push the boundaries of what is possible when old-world constraints are removed and truly limitless capacity and sophistication are leveraged. For example, instead of a traditional three-tier distributed computing stack consisting of a front end web, an application, or middleware tier and an OLTP database for storage, a new approach to this design pattern would be an API gateway that uses an event-driven computing container as an endpoint, and a managed and scalable NoSQL database for persistence. All these components could fall into a serverless model, therefore allowing the design team to focus on the business logic, and not how to achieve the scale required.
Beyond serverless and other advanced managed services lies the future. Currently, the cutting edge of cloud computing is the offerings being released in the artificial intelligence, machine learning, and deep learning space. Mature cloud vendors have services that fall into these categories with continued innovation happening on a hyperscale. It is still very early in the cycle for artificial intelligence, and design teams should expect more to come.
The Cloud native services axis section described the components that could make up a cloud native architecture, and showed ever more mature approaches to creating applications using them. As with all of the design principles on the CNMM, the journey will begin with a baseline understanding of the principle and mature as the design team becomes more and more knowledgeable of how to implement at scale; however, cloud computing components are only one part of the design principles that are required to make up a mature cloud native architecture. These are used in conjunction with the other two principles, automation and application centricity, to create systems that can take advantage of the cloud in a secure and robust way.
The second cloud native principle is about how the application itself will be designed and architected. This section will focus on the actual application design process, and will identify architecture patterns that are mature and lead to cloud native architectures. Similar to the other design principles of the CNMM, developing and architecting cloud native applications is an evolution with different patterns that are typically followed along the way. Ultimately, used in conjunction with the other CNMM principles, the outcome will be a mature, sophisticated, and robust cloud native architecture that is ready to continue evolving as the world of cloud computing expands:
The twelve-factor app is a methodology for building software-as-a-service applications (https://12factor.net/). This methodology was written in late 2011, and is often cited as the base building blocks for designing scalable and robust cloud native applications. Its principles apply to applications written in any programming language, and which use any combination of backing services (database, queue, memory cache, and so on), and is increasingly useful on any cloud vendor platform. The idea behind the twelve-factor app is that there are twelve important factors to consider when designing applications that minimize the time and cost of adding new developers, cleanly interoperate with the environment, can be deployed to cloud vendors, minimize divergence between environment, and allow for scaling of the application. The twelve factors (https://12factor.net/) are as follows:
One code base tracked in revision control, many deploys.
Explicitly declare and isolate dependencies.
Store config in the environment.
Treat backing services as attached resources.
Build, release, run
Strictly separate build and run stages.
Execute the app as one (or more) stateless process(es).
Export services through port binding.
Scale-out through the process model.
Maximize robustness with fast startup and graceful shutdown.
Keep development, staging, and production as similar as possible.
Treat logs as event streams.
Run admin/management tasks as one-off processes.
Previous sections of this chapter have already discussed how the CNMM takes into consideration multiple factors from this methodology. For example, factor 1 is all about keeping the code base in a repository, which is standard best practice. Factors 3, 10, and 12 are all about keeping your environments separate, but making sure they do not drift apart from each other from a code and configuration perspective. Factor 5 ensures that you have a clean and repeatable CICD pipeline with a separation of functions. And factor 11 is about treating logs as event streams so they can be analyzed and acted upon in near real time. The remaining factors align well to cloud native design for the simple reason that they focus on being self-contained (factor 2), treat everything as a service (factors 4 and 7), allow efficient scale-out (factors 6 and 8), and handle faults gracefully (factor 9). Designing an application using the 12 factor methodology is not the only way to develop cloud native architectures; however, it does offer a standardized set of guidelines that, if followed, will allow an application to be more mature on the CNMM.
Architecture design patterns are always evolving to take advantage of the newest technology innovations. For a long time, monolithic architectures were popular, often due to the cost of physical resources and the slow velocity in which applications were developed and deployed. These patterns fit well with the workhorse of computing and mainframes, and even today there are plenty of legacy applications running as a monolithic architecture. As IT operations and business requirements became more complex, and speed to market was gaining importance, additional monolithic applications were deployed to support these requirements. Eventually, these monolithic applications needed to communicate with each other to share data or execute functions that the other systems contained. This intercommunication was the precursor to service-oriented architectures (SOA), which allowed design teams to create smaller application components (as compared to monolithic), implement middleware components to mediate the communication, and isolate the ability to access the components, except through specific endpoints. SOA designs increasingly gained popularity during the virtualization boom, since deploying services became easier and less expensive on virtualized hardware.
Service-oriented architectures consist of two or more components which provide their services to other services through specific communication protocols. The communication protocols are often referred to as web services, and consist of a few different common ones: WSDL, SOAP, and RESTful HTTP, in addition to messaging protocols like JMS. As the complexity of these different protocols and services grew, using an enterprise service bus (ESB) became increasingly common as a mediation layer between services. This allowed for services to abstract their endpoints, and the ESB could take care of the message translations from various sources to get a correctly formatted call to the desired system. While this ESB approach reduced the complexity of communicating between services, it also introduced new complexity in the middleware logic required to translate service calls and handle workflows. This often resulted in very complex SOA applications where application code for each of the components needed to be deployed at the same time, resulting in a big bang and risky major deployment across the composite application. The SOA approach had a positive impact on the blast radius issues that monolithic architectures inherently had by separating core components into their own discrete applications; however, it also introduced a new challenge in the complexity of deployment. This complexity manifested itself in a way that caused so many interdependencies that a single large deployment across all SOA applications was often required. As a result of these risky big bang deployments, they were often only undertaken a few times a year, and drastically reduced velocity slowed the pace of the business requirements. As cloud computing became more common and the constraints of the on-premises environments began to fade a way, a new architecture pattern evolved: microservices. With the cloud, application teams no longer needed to wait months to have compute capacity to test their code, nor were they constrained by a limited number of physical resources, either.
The microservices architecture style takes the distributed nature of SOA and breaks those services up into even more discrete and loosely coupled application functions. Microservices not only reduce blast radius by even further isolating functions, but they also dramatically increase the velocity of application deployments by treating each microservice function as its own component. Using a small DevOps team accountable for a specific microservice will allow for the continuous integration and continuous delivery of the code in small chunks, which increases velocity and also allow for quick rollbacks in the event of unintended issues being introduced to the service.
Microservices and cloud computing fit well together, and often microservices are considered the most mature type of cloud native architecture at this point in time. The reason why they fit so well together is due to the way cloud vendors develop their services, often as individual building blocks that can be used in many ways to achieve a business result. This building block approach gives the application design teams the creativity to mix and match services to solve their problems, rather then being forced into using a specific type of data store or programming language. This has led to increased innovation and design patterns that take advantage of the cloud, like serverless computing services to further obfuscate the management of resources from the development teams, allowing them to focus on business logic.
Regardless of the methodology used or the final cloud native design pattern implemented, there are specific design considerations that all cloud native architectures should attempt to implement. While not all of these are required to be considered a cloud native architecture, as these considerations are implemented, the maturity of the system increases, and will fall on a higher level of the CNMM. These considerations include instrumentation, security, parallelization, resiliency, event-driven, and future-proofed:
- Instrumentation: Including application instrumentation is about more than just log stream analysis; it requires the ability to monitor and measure the performance of the application in real time. Adding instrumentation will directly lead to the ability of the application to be self-aware of latency conditions, component failures due to system faults, and other characteristics that are important to a specific business application. Instrumentation is critical to many of the other design considerations, so including it as a first-class citizen in the application will enable long-term benefits.
- Security: All applications need security built in; however, designing for a cloud native security architecture is critical to ensure the application takes advantage of cloud vendor security services, third-party security services, and design-level security in layers, all of which will harden the posture of the application and reduce the blast radius in the event of an attack or breach.
- Parallelization: Designing an application that can execute distinct processes in parallel with other parts of the application will directly impact its ability to have the performance required as it scales up. This includes allowing the same set of functions to execute many times in parallel, or having many distinct functions in the application execute in parallel.
- Resiliency: Considering how the application will handle faults and still perform at scale is important. Using cloud vendor innovations, like deployment across multiple physical data centers, using multiple decoupled tiers of the application, and automating the startup, shutdown, and migration of application components between cloud vendor locations are all ways to ensure resiliency for the application.
- Event-driven: Applications that are event-driven are able to employ techniques that analyze the events to perform actions, whether those be business logic, resiliency modification, security assessments, or auto scaling of application components. All events are logged and analyzed by advanced machine learning techniques to enable additional automation to be employed as more events are identified.
- Future-proofed: Thinking about the future is a critical way to ensure that an application will continue to evolve along the CNMM as time and innovation moves on. Implementing these considerations will help with future-proofing; however, all applications must be optimized through automation and code enhancements constantly to always be able to deliver the business results required.
There are many different methodologies that can be employed to create a cloud native application, including microservices, twelve-factor app design patterns, and cloud native design considerations. There is no one correct path for designing cloud native applications, and as with all parts of the CNMM, maturity will increase as more robust considerations are applied. The application will reach the peak maturity for this axis once most of these designs are implemented.
The third and final cloud native principle is about automation. Throughout this chapter, the other CNMM principles have been discussed in detail and explained, particularly why using cloud native services and application centric design enable cloud native architectures to achieve scale. However, these alone do not allow a system to really take advantage of the cloud. If a system were designed using the most advanced services available, but the operational aspects of the application were done manually, it would have a hard time realizing its intended purpose. This type of operational automation is often referred to as Infrastructure as Code, and there is an evolution in maturity to achieve a sophisticated cloud native architecture. Cloud vendors typically develop all of their services to be API endpoints, which allows for programmatic calls to create, modify, or destroy services. This approach is the driver behind Infrastructure as Code, where previously an operations team would be responsible for the physical setup and deployment of components in addition to the infrastructure design and configuration.
With Infrastructure as Code automation, operations teams can now focus on the application-specific design and rely on the cloud vendor to handle the undifferentiated heavy lifting of resource deployment. This Infrastructure as Code is then treated like any other deployment artifact for the application, and is stored in source code repositories, versioned and maintained for long-term consistency of environment buildouts. The degree of automation still evolves on a spectrum with the early phases being focused on environment buildout, resource configuration, and application deployments. As a solution matures, the automation will evolve to include more advanced monitoring, scaling, and performance activities, and ultimately include auditing, compliance, governance, and optimization of the full solution. From there, automation of the most advanced designs use artificial intelligence and machine and deep learning techniques to self-heal and self-direct the system to change its structure based on the current conditions.
Automation is the key to achieving the scale and security required by cloud native architectures:
Designing, deploying, and managing an application in the cloud is complicated, but all systems need to be set up and configured. In the cloud, this process can become more streamlined and consistent by developing the environment and configuration process with code. There are potentially lots of cloud services and resources involved that go well beyond traditional servers, subnets, and physical equipment being managed on-premises. This phase of the automation axis focuses on API-driven environment provisioning, system configuration, and application deployments, which allows customers to use code to handle these repeatable tasks.
Whether the solution is a large and complex implementation for an enterprise company or a relatively straightforward system deployment, the use of automation to manage consistency is critical to enabling a cloud native architecture. For large and complex solutions, or where regulatory requirements demand the separation of duties, companies can use Infrastructure as Code to isolate the different operations teams to focus only on their area—for example, core infrastructure, networking, security, and monitoring. In other cases, all components could be handled by a single team, possibly even the development team if a DevOps model is being used. Regardless of how the development of the Infrastructure as Code happens, it is important to ensure that agility and consistency are a constant in the process, allowing systems to be deployed often, and following the design requirements exactly.
There are multiple schools of thought on how to handle the operations of a system using Infrastructure as Code. In some cases, every time an environment change occurs, the full suite of Infrastructure as Code automation is executed to replace the existing environment. This is referred to as immutable infrastructure, since the system components are never updated, but replaced with the new version or configuration every time. This allows a company to reduce environment or configuration drift, and also ensures a strict way to prove governance and reduce security issues that could be manually implemented.
While the immutable infrastructure approach has its advantages, it might not be feasible to replace the full environment every time, and so changes must be made at a more specific component level. Automation is still critical with this approach to ensure everything is implemented with consistency; however, it would make the cloud resources mutable, or give them the ability to change over time. There are numerous vendors that have products to achieve instance-level automation, and most cloud vendors have managed offerings to perform this type of automation. These tools will allow for code or scripts to be run in the environment to make the changes. These scripts would be a part of the Infrastructure as Code deployment artifacts, and would be developed and maintained in the same way as the set of immutable scripts.
Environment management and configuration is not the only way automation is required at this baseline level. Code deployments and elasticity are also very important components to ensuring a fully automated cloud native architecture. There are numerous tools on the market that allow for the full deployment pipeline being automated, often referred to as continuous integration, continuous deployment (CICD). A code deployment pipeline often includes all aspects of the process, from code check-in, automated compiling with code analysis, packaging, and deployment, to specific environments with different hooks or approval stops to ensure a clean deployment. Used in conjunction with Infrastructure as Code for environment and operations management, CICD pipelines allow for extreme agility and consistency for a cloud native architecture.
Cloud native architectures that use complex services and span across multiple geographic locations require the ability to change often based on usage patterns and other factors. Using automation to monitor the full span of the solution, ensure compliance with company or regulatory standards, and continuously optimize resource usage shows an increasing level of maturity. As with any evolution, building on the previous stages of maturity enables the use of advanced techniques that allow for increased scale to be introduced to a solution.
One of the most important data points that can be collected is the monitoring data that cloud vendors will have built into their offerings. Mature cloud vendors have monitoring services natively integrated to their other services that can capture metrics, events, and logs of those services that would otherwise be unobtainable. Using these monitoring services to trigger basic events that are happening is a type of automation that will ensure overall system health. For example, a system using a fleet of compute virtual machines as the logic tier of a services component normally expects a certain amount of requests, but at periodic times a spike in requests causes the CPU and network traffic on these instances to quickly increase. If properly configured, the cloud monitoring service will detect this increase and launch additional instances to even out the load to more acceptable levels and ensure proper system performance. The process of launching additional resources is a design configuration of the system that requires Infrastructure as Code automation, to ensure that the new instances are deployed using the same exact configuration and code as all the others in the cluster. This type of activity is often called auto scaling, and it also works in reverse, removing instances once the spike in requests has subsided.
Automated compliance of environment and system configurations is becoming increasingly critical for large enterprise customers. Incorporating automation to perform constant compliance audit checks across system components shows a high level of maturity on the automation axis. These configuration snapshot services allow a complete picture in time of the full environmental makeup, and are stored as text for long-term analysis. Using automation, these snapshots can be compared against previous views of the environment to ensure that configuration drift has not happened. In addition to previous views of the environment, the current snapshot can be compared against desired and compliant configurations that will support audit requirements in regulated industries.
Optimization of cloud resources is an area that can easily be overlooked. Before the cloud, a system was designed and an estimation was used to determine the capacity required for that system to run at peak requirements. This led to the procurement of expensive and complex hardware and software before the system had even been created. Due to that, it was common for a significant amount of over-provisioned capacity to sit idle and waiting for an increase in requests to happen. With the cloud, those challenges all but disappear; however, system designers still run into situations where they don't know what capacity is needed. To resolve this, automated optimization can be used to constantly check all system components and, using historical trends, understand if those resources are over-or under-utilized. Auto scaling is a way to achieve this; however, there are much more sophisticated ways that will provide additional optimization if implemented correctly; for example, using an automated process to check running instances across all environments for under-used capacity and turning them off, or performing a similar check to shut down all development environments on nights and weekends, could save lots of money for a company.
One of the key ways to achieve maturity for a cloud native architecture with regards to monitoring, compliance, and optimization is to leverage an extensive logging framework. Loading data into this framework and analyzing that data to make decisions is a complex task, and requires the design team to fully understand the various components and make sure that all required data is being captured at all times. These types of frameworks help to remove the thinking that logs are files to be gathered and stored, and instead focus on logs as streams of events that should be analyzed in real time for anomalies of any kind. For example, a relatively quick way to implement a logging framework would be to use ElastiCache, Logstash, and Kibana, often referred to as an ELK stack, to capture all types of the system log events, cloud vendor services log events, and other third-party log events.
As a system evolves and moves further up the automation maturity model, it will rely more and more on the data it generates to analyze and act upon. Similar to the monitoring, compliance, and optimization design from the previous part of the axis, a mature cloud native architecture will constantly be analyzing log event streams to detect anomalies and inefficiencies; however, the most advanced maturity is demonstrated by using artificial intelligence (AI) and machine learning (ML) to predict how events could impact the system and make proactive adjustments before they cause performance, security, or other business degradation. The longer the event data collected is stored and the amount of disparate sources the data comes from will allow these techniques to have ever-increasing data points to take action upon.
Data is king when it comes to predictive analytics and machine learning. The never-ending process of teaching a system how to categorize events takes time, data, and automation. Being able to correlate seemingly unrelated data events to each other to form a hypothesis is the basis of AI and ML techniques. These hypotheses will have a set of actions that can be taken if they occur, which, in the past, has resulted in anomaly correction. Automated responses to an event that matches an anomaly hypothesis and taking corrective action is an example of using predictive analytics based on ML to resolve an issue before it becomes business-impacting. In addition, there will always be situations where a new event is captured and historical data cannot accurately correlate that to a previously known anomaly. Even still, this lack of correlation is actually an indicator in itself and will enable the cross-connection of data events, anomalies, and responses to gain more intelligence.
There are many examples of how using ML on datasets will show correlation that could not be seen by a human reviewing the same datasets—like how often a failed user login resulted in a lockout versus a retry over millions of different attempts, and if those lockouts were the result of a harmless user forgetting a password, or a brute-force attack to gain system entry. Because the algorithm can search all required datasets and correlate the results, it will be able to identify patterns of when an event is harmless or malicious. Using the output from these patterns, predictive actions can be taken to prevent potential security issues by quickly isolating frontend resources or blocking requests from users deemed to be malicious due to where they come from (IP or country specific), the type of traffic being transmitted (Distributed Denial of Service), or another scenario.
This type of automation, if implemented correctly across a system, will result in some of the most advanced architectures that are possible today. With the current state of the cloud services available, using predictive analytics, artificial intelligence, and machine learning is the cutting edge of how a mature cloud native architecture can be designed; however, as the services become more mature, additional techniques will be available and innovative people will continue to use these in ever-increasing maturity to ensure their systems are resilient to business damage.
Automation unlocks significant value when implemented for a cloud native architecture. The maturity level of automation will evolve from simply setting up environments and configuring components to performing advanced monitoring, compliance, and optimization across the solution. Combined with increased innovation of cloud vendor services, the maturity of automation and using artificial intelligence and machine learning will allow for predictive actions to be taken to resolve common, known, and increasingly unknown anomalies in a system. This combination of cloud vendor services adoption and automation form two of the three critical design principles for the CNMM, with the application design and architecture principle being the final requirement.
Companies large and small, new and mature, are seeing the benefits of cloud computing. There are many paths to get to the cloud, and that often depends on the maturity of the organization and the willingness of senior management to enact the change required. Regardless of the type of organization, the shift to cloud computing is a journey that will take time, effort, and persistence to be successful. It is easy for a company to say they want to be cloud native; however, for most companies, getting there is a complex and difficult prospect. For organizations that are mature and have lots of legacy workloads and manage data centers, they will have to not only identify a roadmap and plan for migration, but also manage the people and process aspect of the journey. For companies that are newer and don't have a lot of technical debt in the form of traditional workloads, their journey will accelerate with the cloud being the place of early experimentation; however, maturing to a cloud native enterprise will still take time.
Cloud computing is here to stay. Years ago there were many discussions on whether a company should declare a cloud-first model, or not chase the latest and greatest technologies; however, at this point in time, just about every company has taken the first step towards cloud computing, and many have made the decision to be a cloud-first organization. At its most basic level, making this decision simply means that all new workloads will be deployed to the chosen cloud vendor unless it is proven that this will not be sufficient for the business requirements. Sometimes this happens due to information security (that is, government-classified or regulatory conditions), and sometimes it's because of a specific technical issue or limitation that the cloud vendor has, which is difficult to overcome in a short time. Regardless, the vast majority of new projects will end up in the cloud with various stages of maturity, as described in the CNMM earlier.
Even though this decision is common in today's IT environment, there are still challenges that need to be addressed for it to be successful. IT and business leaders need to ensure that their people and processes are aligned to a cloud-first model. In addition, developing a DevOps and agile methodology will help the organization overcome the slow and rigid nature of waterfall projects with siloed development and operations teams.
Organizations with large IT departments or long-term outsourced contracts will inherently have a workforce that is skilled at the technologies that have been running in the company up until that point. Making the shift to any new technology, especially cloud computing, will require a significant amount of retooling, personnel shifting, and a change in the thought pattern of the workforce. Organizations can overcome these people challenges by splitting their IT workforce into two distinct sections: those who maintain the legacy workloads and keep the original methodologies, and those who work on the cloud-first model and adopt the new technologies and process to be successful. This approach can work for a while; however, over time, and as workloads are moved to the target cloud platform, more and more people will shift to the new operating model. The benefits of this approach allow a select few people who are passionate and excited to learn new technologies and techniques to be the trail-blazers, while the rest of the workforce can retool their skills at a more methodical pace.
One specific area that can often be difficult for experienced IT professionals to overcome, especially if they have gained their experience with data center deployments and lots of large legacy workloads, is the concept of unlimited resources. Since most cloud vendors have effectively unlimited resources to be consumed, removing that constraint on application design will open up a lot of unique and innovative ways to solve problems that were impossible when designing applications before. For example, being bound to a specific set of CPU processors to complete a batch job will cause developers to design less parallelization, whereas with unlimited CPUs, the entire job could be designed to be run in parallel, potentially faster and cheaper than with lots of serial executions. Those people who can think big and remove constraints should be considered for the trail-blazers team.
Processes are also a big hurdle for being a cloud-first organization. Lots of companies that are transitioning to the cloud phase are also in transitioning from the SOA to microservices phase. Therefore, it would be common for the processes in place to be supportive of SOA architectures and deployments, which are most likely there to slow things down and ensure that the big bang deployments to the composite application are done correctly and with significant testing. Being cloud-first and using microservices, the goal is to deploy as fast as possible and as many times as possible, to support quickly changing business requirements. Therefore, modifying processes to support this agility is critical. For example, if an organization is strictly following ITIL, they might require a strict approval chain with checks and balances before any modification or code deployment can be made to production. This process is probably in place because of the complex interconnected nature of the composite applications, and one minor change could impact the entire set of systems; however, with microservices architectures, since they are fully self-contained and only publish an API (usually), as long as the API is not changing, the code itself would not impact other services. Changing processes to allow for lots of smaller deployments or rollbacks will ensure speed and business agility.
The cloud is not a magic place where problems go away. It is a place where some of the traditional challenges go away; however, new challenges will come up. Legacy enterprise companies have been making the switch from waterfall project management to agile for a while now. That is good news for a company intending to be cloud native, since iteration, failing fast, and innovation are critical to long-term success, and agile projects allow for that type of project delivery. A large part of the reason this methodology is popular with cloud native companies is the fast pace of innovation that cloud vendors are going through. For example, AWS launched 1,430 new services and features in 2017, which is almost four per day, and it is set to eclipse that again in 2018. With this level of innovation happening, cloud services are changing, and using an agile methodology to manage cloud native projects will enable companies to take advantage of these as they come out.
DevOps (or the merging of development teams and operations teams) is a new IT operating model that helps bridge the gap between how code is developed and how it is operated once deployed to production. Making a single team accountable for the code logic, testing, deployment artifacts, and the operations of the system will ensure that nothing is lost in the life cycle of a code development process. This model fits well with the cloud and microservices, since it enables a small team to own the full service, write in whatever code they are most suited to, deploy to the cloud platform chosen by the company, and then operate that application and be in the best position to resolve any issues the application might have once it's in production.
Together, agile methodologies and DevOps are a critical change needed by companies that are considering the move to becoming a cloud native organization.
The journey to the cloud will take time and lots of trial and error. Typically, a company will have identified a primary cloud vendor for their requirements and, in some cases, they will have a second cloud vendor for specific requirements. In addition, almost all companies begin with a hybrid architecture approach, which allows them to leverage their existing investments and applications while moving workloads into their chosen cloud. Often, the cloud native journey begins with a single workload being either migrated or designed for the cloud, which gives critical experience to the design team and helps create the operating foundation the organization will use for the cloud.
The cloud is a vast set of resources that can be used to solve all kinds of business problems; however, it is also a complex set of technologies that requires not only skillful people to use it, but also a strict operating foundation to ensure it is being done securely and with cost and scale in mind. Even before a single workload is deployed to the cloud, it is important for a company to fully identify their expected foundational design. This would include everything from account structures, virtual network design, regional/geographic requirements, security structure in terms of areas such as identity and access management and compliance, or governance considerations with regards to specific services to be used for different types of workloads. Understanding how to leverage Infrastructure as Code, as pointed out in the axis automation earlier, is also a critical element that should be identified early.
Once all of the decisions are made and the cloud operating foundation is in place, that is the time for the initial projects to begin. Between the decision-making process and the first few projects being deployed, the DevOps teams will gain lots of experience with both the agile pace of working, the target cloud vendor platform, and the company's set of guidelines and approaches to their cloud native environment.
In addition to the foundation of the cloud platform, a company must decide how to leverage its existing assets. While the value proposition of cloud computing is not debated much anymore, the pace of migration and how fast to deprecate existing assets is. Using a hybrid cloud approach for the beginning of the cloud native journey is very common, and lets the company easily operate with its two existing groups (the legacy group and the cloud-first group). This approach will also enable a cheaper pathway to success, since it doesn't require a 'big bang' migration from existing data centers to the cloud, but allows for individual projects, business units, or other segregated areas to move faster than others.
All cloud vendors have a hybrid architecture option that can be leveraged when a company wants to keep some workloads in their data centers and have others in the cloud. This hybrid architecture approach typically involves setting up some type of network connectivity from one or more data center(s) to one or more cloud vendor geographical region(s). This network connectivity can take place in the form of a VPN over public internet paths, or various dedicated fiber options. Regardless of the type of connectivity, the outcome should be a single network that makes all company resources and workloads visible to each other (within security and governance constraints). Typical patterns for a hybrid cloud architecture are:
- Legacy workloads on-premises and new projects in the cloud
- Production workloads on-premises and non-production in the cloud
- Disaster recovery environment in the cloud
- Storage or archival in the cloud
- On-premises workloads bursting into the cloud for additional capacity
Over time, as more workloads are migrated into the cloud or retired from the on-premises environments, the center of gravity will shift to the cloud and an organization will have more resources in the cloud than on-premises. This progress is natural, and will signify the tipping point of a company that is well into its cloud native journey. Eventually, a cloud native company would expect to have all of its workloads in the cloud and remove just about all hybrid connectivity options since they are no longer in use. At that point, the organization would be considered a mature cloud native company.
Enterprise companies need to ensure their risk is spread out so that they reduce the blast radius in the event of an issue, whether this be a natural disaster, security event, or just making sure that they are covering their customers in all of the locations they operate in. Therefore, the allure of a multi-cloud environment is strong, and some larger organizations are starting to go down this path for their cloud journey. In the right circumstances, this approach does make sense and gives the additional assurance that their business can withstand specific types of challenges; however, for most companies this type of architecture is going to add significant complexity and possibly slow down adoption of the cloud.
The myth of multi-cloud deployments and architectures is often spread around by system integrators that thrive on complexity and change management. They want to promote the most complex and design-heavy architecture possible, so that a company feels compelled to leverage them more to ensure their IT operations are running smoothly. Multi-cloud is the most recent way of going about this, since taking this route will require twice the amount of cloud-specific knowledge and twice the amount of hybrid or intercloud connectivity. Often, there is a promise of a cloud broker, where a single platform can manage resources in multiple clouds and on-premises to make the cloud operations easier. The challenge with this school of thought is that these cloud brokers are really just exposing the lowest common denominator of the cloud vendors, typically instances, storage, load balancers, and so on, and do not have the ability to allow use of the most innovative services from the chosen cloud vendors. This will stifle cloud native architecture innovation and force the company into a similar operating model as they used before the cloud, often paying another company to manage the environments for them and not gaining much from their cloud journey.
Another common approach to multi-cloud is the use of containers for moving workloads between clouds. In theory, this approach works and solves a lot of the challenges that multi-cloud poses. There is currently a lot of innovation going on with this approach and the ability to be successful with moving containers between clouds is still in its infancy. As additional frameworks, tools, and maturity level appear, this is an area that could promise a new way to create cutting edge cloud native architectures.
Companies that are in their cloud native journey and are considering a multi-cloud approach should ask themselves the reasons why this is being considered. The authors of this book would argue that organizations would gain more speed and efficiency in the early and middle parts of their journey if they choose a single cloud vendor and focus all of their re-tooling, efforts, and people on that, versus trying to add a second cloud into the design. Ultimately, choose the path that will best serve the needs of the business and that will fit culturally into the organization.
Companies will start off the journey with the decision to be a cloud-first organization and the creation of a DevOps team, and will then continue with choosing a cloud vendor and setting up the target cloud-operating foundation. Soon after these activities are complete, the time to scale-out and ramp up migrations begins. A cloud native company will have the goal of reducing their self-managed data centers and workloads and shifting those as much as possible to the cloud. There are three main paths this can present:
- Lift-and-shift migration of legacy workloads to the cloud
- Re-engineering of legacy workloads to optimize in the cloud
- Greenfield cloud native development
For most large enterprise companies, all three of these options will take place with different parts of the legacy workloads. For smaller companies, any mix of the three could be employed, depending on the outcomes being sought.
Lift-and-shift migration is the act of moving existing workloads, as is, to the target cloud-operating foundation already implemented. This type of exercise is usually done against a grouping of applications, by business unit, technology stack, or complexity level of some other type of metric. A lift-and-shift migration in its purest form is literally bit-by-bit copies of existing instances, databases, storage, and so on, and is actually rare, since the cost benefits of doing this to the cloud would be negligible. For example, moving 100 instances from an on-premises data center to the cloud, with no changes to size or taking into consideration scaling options, would most likely result in a higher cost to the company.
The more common derivative of a lift-and-shift is a lift-tinker-shift migration, where the majority of the workloads are moved; however, specific components are upgraded or swapped out for cloud services. For example, moving 100 instances from an on-premises data center to the cloud, but standardizing on a specific operating system (for example, Red Hat Enterprise Edition), moving all databases into a cloud vendor managed service (for example, Amazon Relational Database Service), and storing backup or archive files in a cloud storage blog storage (for example, Amazon Simple Storage Service) would constitute a lift-tinker-shift migration. This type of migration would most likely save the company a lot of money for the business case, take advantage of some of the most mature services in the cloud, and allow for significant long-term advantages with future deployments.
Companies that are truly moving to be a cloud native organization will most likely choose to re-engineer most of their legacy workloads so that they can take advantage of the scale and innovation that the cloud has to offer. Workloads that are chosen to be migrated to the cloud but re-engineered in the process might take longer to move, but once completed they will fall on some part of the CNMM and be considered cloud native. These types of migrations are not quite greenfield development projects, but are also not lift-and-shift migrations either; they are designed to have significant portions of the application workloads rewritten or replatformed, so they fit the cloud native standards. For example, a composite application contains 100 instances using a traditional SOA architecture, containing five different distinct workloads with an ESB to mediate traffic. To re-engineer this composite application, the company would decide to remove the ESB, break the distinct workloads into more function-based microservices, remove as many instances as possible by leveraging serverless cloud services, and reformat the database to be NoSQL instead of relational.
Migrating workloads using a re-engineering approach is a good way for the trail blazers of a company's DevOps team to create a significant project, dive deep into the designing of the architecture, and employ all new skills and techniques for their cloud native journey. We believe that, over time, the majority of migration projects will be re-engineering existing workloads to take advantage of cloud computing.
While technically not a migration, cloud native companies that are creating new applications will choose to go through the entire development cycle with a cloud native architecture in mind. Even workloads that are re-engineered might not be able to fully change their underlying technologies for whatever reason. When a company chooses to go full cloud native development, all legacy approaches to development, scale constraints, slow deployments, and process and legacy-skilled workers are removed, and only the latest and greatest cloud services, architectures, and techniques are employed. Companies that have gotten to this phase of the journey are truly cloud native, and are set up for long-term success with how they develop and deploy business applications.
Netflix is often the first company that is brought up when people talk about a visionary cloud native company, but why? This section will break up the journey that Netflix has undertaken to get to the point it is at today. Using the CNMM, each of the axis will be discussed and key points taken into account to demonstrate their maturity along the cloud native journey.
As with all major migrations to the cloud, Netflix's journey was not something that happened overnight. As early as May 2010, Netflix had been publicly touting AWS as its chosen cloud computing partner. The following quote has been extracted from the press release that both companies published at that time (http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=irol-newsArticle&ID=1423977):
"Amazon Web Services today announced that Netflix, Inc., has chosen AWS to run a variety of mission-critical, customer-facing and backend applications. While letting Amazon Web Services do the worrying about the technology infrastructure, Netflix continues to improve its members' overall experience of instantly watching TV episodes and movies on the TV and computer, and receiving DVDs by mail."
That press release goes on to say that Netflix had actually been using AWS for the experimentation of workload development for over a year, meaning that since 2009, Netflix has been on their cloud native journey. Since AWS released its first service in 2006, it is evident that Netflix saw the benefits from the very beginning and aggressively moved to take advantage of the new style of computing.
They phased the migration of components over time to reduce risk, gain experience, and leverage the newest innovations that AWS was delivering. Here's a quick timeline of their migration [2009 - 2010] http://www.sfisaca.org/images/FC12Presentations/D1_2.pdf, [2011 - 2013] https://www.slideshare.net/AmazonWebServices/ent209-netflix-cloud-migration-devops-and-distributed-systems-aws-reinvent-2014 (Slide 11), and  https://medium.com/netflix-techblog/netflix-billing-migration-to-aws-451fba085a4:
- 2009: Migrating video master content system logs into AWS S3
- 2010: DRM, CDN Routing, Web Signup, Search, Moving Choosing, Metadata, Device Management, and more were migrated into AWS
- 2011: Customer Service, International Lookup, Call Logs, and Customer Service analytics
- 2012: Search Pages, E-C, and Your Account
- 2013: Big Data and Analytics
- 2016: Billing and Payments
You can read more about this at https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration. This seven-year journey enabled Netflix to completely shut down their own data centers in January 2016, and so they are now a completely cloud native company. Admittedly, this journey for Netflix was not easy, and a lot of tough decisions and trade-offs had to be made along the way, which will be true for any cloud native journey; however, the long-term benefits of re-engineering a system with a cloud native architecture, instead of just moving the current state to the cloud, means that all of the technical debt and other limitations are left behind. Therefore, in the words of Yury Izrailevsky (Vice President, Cloud and Platform Engineering at Netflix):
"We chose the cloud native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company. Architecturally, we migrated from a monolithic app to hundreds of micro-services, and denormalized and our data model, using NoSQL databases. Budget approvals, centralized release coordination and multi-week hardware provisioning cycles made way to continuous delivery, engineering teams making independent decisions using self-service tools in a loosely coupled DevOps environment, helping accelerate innovation."
This amazing journey for Netflix continues to this day. Since the cloud native maturity model doesn't have an ending point, as cloud native architectures mature, so too will the CNMM and those companies that are pushing the boundaries of how to develop these architectures.
The journey to becoming a cloud native company at Netflix was impressive, and continues to yield benefits for Netflix and its customers. The growth Netflix was enjoying around 2010 and beyond made it difficult for them to logistically keep up with the demand for additional hardware and the capacity to run and scale their systems. They quickly realized that they were an entertainment creation and distribution company, and not a data center operations company. Knowing that managing an ever-growing number of data centers around the world would continue to cause huge capital outflows and require a focus that was not core to their customers, they made their cloud-first decision.
Elasticity of the cloud is possibly the key benefit for Netflix, as it allows them to add thousands of instances and petabytes of storage, on demand, as their customer base grows and usage increases. This reliance on the cloud's ability to provide resources as required also includes their big data and analytics processing engines, video transcoding, billing, payments, and many other services that make their business run. In addition to the scale and elasticity that the cloud brings, Netflix also cites the cloud as away to significantly increase their services availability. They were able to use the cloud to distribute their workloads across zones and geographies that use fundamentally unreliable but redundant components to achieve their desired 99.99% availability of services.
Finally, while cost was not a key driver for their decision to move to the cloud, their costs per streaming start ended up being a fraction of what it was when they managed their own data centers. This was a very beneficial side effect of the scale they were able to achieve, and the benefit was only possible due to the elasticity of the cloud. Specifically, this enabled them to continuously optimize instance type mix and to grow and shrink our footprint near-instantaneously without the need to maintain large capacity buffers. We can also benefit from the economies of scale that are only possible in a large cloud ecosystem. These benefits have enabled Netflix to have a laser focus on their customer and business requirements, and not spend resources on areas that do not directly impact that business mission.
Now that we understand what the Netflix journey was about and how they benefited from that journey, this section will use the CNMM to evaluate how that journey unfolded and where they stand on the maturity model. Since they have been most vocal about the work they did to migrate their billing and payment system to AWS, that is the workload that will be used for this evaluation. That system consisted of batch jobs, billing APIs, and integrations, with other services in their composite application stack, including an on-premises data center at the time. Full details about that migration can be found at their blog, https://medium.com/netflix-techblog/netflix-billing-migration-to-aws-451fba085a4, on this topic.
The focus of the cloud native services adoption spectrum is to demonstrate the amount of cloud vendor services that are in use for the architecture. While the full extent of the services that Netflix uses is unknown, they have publicly disclosed numerous AWS services that help them achieve their architecture. Referring to the mature cloud vendor services diagram in the beginning of this chapter, they certainly use most of the foundational services that fall into the infrastructure: networking, compute, storage, and database tiers. They also use most of the services from the security and application services tier. Finally, they have discussed their usage of lots of services in the management tools, analytics, dev tools, and artificial intelligence tiers. This amount of services usage would classify Netflix as a very mature user of cloud native services, and therefore they have a high maturity on the cloud native services axis.
It is also important to note that Netflix also uses services that are not in the cloud. They are very vocal that their usage of content delivery networks (CDNs) are considered a core competency for their business to be successful, and therefore they set up and manage their own global content network. This point is made in a blog post at https://media.netflix.com/en/company-blog/how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience by the company in 2016, where they articulated their usage of AWS and CDNs and why they made their decisions:
"Essentially everything before you hit "play" happens in AWS, including all of the logic of the application interface, the content discovery and selection experience, recommendation algorithms, transcoding, etc.; we use AWS for these applications because the need for this type of computing is not unique to Netflix and we can take advantage of the ease of use and growing commoditization of the "cloud" market. Everything after you hit "play" is unique to Netflix, and our growing need for scale in this area presented the opportunity to create greater efficiency for our content delivery and for the internet in general."
In addition, there are cases where they choose to use open source tools running on cloud building blocks, like Cassandra for their NoSQL database, or Kafka for their event streams. These architecture decisions are the trade-offs they made to ensure that they are using the best tools for their individual needs, not just what a cloud vendor offers.
Designing an application for the cloud is arguably the most complicated part of the journey, and having a high level of maturity on the application centric design axis will require specific approaches. Netflix faced some big challenges during its billing and payment system migration to the cloud; specifically, they wanted near-zero downtime, massive scalability, SOX compliance, and global rollout. At the point of time where they begin this project, they already had many other systems running in the cloud as decoupled services. Therefore, they used the same decoupling approach by designing microservices for their billing and payment systems.
To quote their blog on this topic:
"We started chipping away existing code into smaller, efficient modules and first moved some critical dependencies to run from the Cloud. We moved our tax solution to the Cloud first. Next, we retired serving member billing history from giant tables that were part of many different code paths. We built a new application to capture billing events, migrated only necessary data into our new Cassandra data store and started serving billing history, globally, from the Cloud. We spent a good amount of time writing a data migration tool that would transform member billing attributes spread across many tables in Oracle into a much simpler Cassandra data structure. We worked with our DVD engineering counterparts to further simplify our integration and got rid of obsolete code."
The other major redesign during this process was the move from an Oracle database's heavy relational design to a more flexible and scalable NoSQL data structure for subscription processing, and a regionally distributed MySQL relational database for user-transactional processing. These changes required other Netflix services to modify their design to take advantage of the decoupling of data storage and retry the ability for data input to their NoSQL database solution. This enabled Netflix to migrate millions of rows from their on-premises Oracle database to Cassandra in AWS without any obvious user impact.
During this billing and payment system migration to the cloud, Netflix made many significant decisions that would impact its architecture. These decisions were made with long-term impact in mind, which caused a longer migration time, but ensured a future-proofed architecture that could scale as they grew internationally. The cleaning up of code to remove technical debt is a prime example of this, and allowed them to ensure the new code base was designed using microservices, and had other cloud native design principles included. Netflix has demonstrated a high level of maturity on the application-centric design axis.
The automation axis demonstrates a company's ability to manage, operate, optimize, secure, and predict how their systems are behaving to ensure a positive customer experience. Netflix understood early on in their cloud journey that they had to develop new ways to verify that their systems were operating at the highest level of performance, and almost more importantly, that their systems were resilient to service faults of all kinds. They created a suite of tools called the Simian Army (https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116), which includes all kinds of automation that is used to identify bottlenecks, break points, and many other types of issues that would disrupt their operations for customers. One of the original tools and the inspiration for the entire Simian Army suite is their Chaos Monkey; in their words:
"...our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice."
Having systems that can survive randomly shutting down critical services is the definition of high levels of automation. This means that the entire landscape of systems must follow strict automated processes, including environment management, configuration, deployments, monitoring, compliance, optimization, and even predictive analytics. Chaos Monkey went on to inspire many other tools that all fall into the Simian Army toolset. The full suite of the Simian Army is:
- Latency Monkey: Induces artificial delays into RESTful calls to similar service degradation
- Conformity Monkey: Finds instances that do not adhere to predefined best practices and shuts them down
- Doctor Monkey: Taps into health checks that run on an instance to monitor external signs of health
- Janitor Monkey: Searches for unused resources and disposes of them
- Security Monkey: Finds security violations and vulnerabilities and terminates offending instances
- 10-18 Monkey: Detects configuration and runtime problems in specific geographic regions
- Chaos Gorilla: Similar to Chaos Monkey, but simulates an entire outage of AWS available zones
However, they didn't stop there. They also created a cloud wide telemetry and monitoring platform known as Atlas (https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a), which is responsible of capturing all-time series data. The primary goal for Atlas is to support queries over dimensional time series data so they can drill down into problems as quickly as possible. This tool satisfies the logging aspect of the twelve-factor app design, and allows them to have enormous amounts of data and events to analyze and take action on before they become customer-impacting. In addition to Atlas, in 2015 Netflix released a tool called Spinnaker (https://www.spinnaker.io/), which is an open source, multi-cloud, continuous delivery platform for releasing software changes with high velocity and confidence. Netflix is constantly updating and releasing additional automation tools that help them manage, deploy, and monitor all their services, using globally distributed AWS regions and, in some cases, using other cloud vendor services.
Netflix has been automating everything in their environment for almost as long as they have been migrating workloads to the cloud. Today, they rely on those tools to ensure their global network is functioning properly and serving their customers. Therefore, they would fall on the highly mature level of the automation axis.
In this chapter, we defined exactly what a cloud native is and what areas of focus are required to develop a mature cloud native architecture. Using a CNMM, we identified that all architectures will have three design principles: cloud services adoption, degree of automation, and application-centric design. These principles are used to gauge the maturity of the components of the architecture, as it relates to them and where they fall on their own spectrum. Finally, we broke down what a cloud native journey is for a company, how they make the cloud-first decision, how they change their people, process, and technology, how they create a cloud-operating environment, and finally, how a company would migrate or redesign their workloads to be in the cloud-first world they have created.
In the next chapter, we will start out with a deep dive into the cloud adoption framework, and understand the cloud journey that a company undertakes in more detail by looking into the seven pillars of the framework. We will understand migrations and the greenfield development of the journey, and we will finish with the security and risk that come along with the adoption of the cloud.