Kubernetes in Production Best Practices

3 (2 reviews total)
By Aly Saleh , Murat Karslioglu
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 2: Architecting Production-Grade Kubernetes Infrastructure

About this book

Although out-of-the-box solutions can help you to get a cluster up and running quickly, running a Kubernetes cluster that is optimized for production workloads is a challenge, especially for users with basic or intermediate knowledge. With detailed coverage of cloud industry standards and best practices for achieving scalability, availability, operational excellence, and cost optimization, this Kubernetes book is a blueprint for managing applications and services in production.

You'll discover the most common way to deploy and operate Kubernetes clusters, which is to use a public cloud-managed service from AWS, Azure, or Google Cloud Platform (GCP). This book explores Amazon Elastic Kubernetes Service (Amazon EKS), the AWS-managed version of Kubernetes, for working through practical exercises. As you get to grips with implementation details specific to AWS and EKS, you'll understand the design concepts, implementation best practices, and configuration applicable to other cloud-managed services. Throughout the book, you’ll also discover standard and cloud-agnostic tools, such as Terraform and Ansible, for provisioning and configuring infrastructure.

By the end of this book, you’ll be able to leverage Kubernetes to operate and manage your production environments confidently.

Publication date:
March 2021


Chapter 2: Architecting Production-Grade Kubernetes Infrastructure

In the previous chapter, you learned about the core components of Kubernetes and the basics of its infrastructure, and why putting Kubernetes in production is a challenging journey. We introduced the production-readiness characteristics for the Kubernetes clusters, along with our recommended checklist for the services and configurations that ensure the production-readiness of your clusters.

We also introduced a group of infrastructure design principles that we learned through building production-grade cloud environments. We use them as our guideline through this book whenever we make architectural and design decisions, and we highly recommend that cloud infrastructure teams consider these when it comes to architecting new infrastructure for Kubernetes and cloud platforms in general.

In this chapter, you will learn about the important architectural decisions that you will need to tackle while designing your Kubernetes infrastructure. We will explore the alternatives and the choices that you have for each of these decisions, along with the possible benefits and drawbacks. In addition to that, you will learn about the cloud architecture considerations, such as scaling, availability, security, and cost. We do not intend to make final decisions but provide the guidance because every organization has different needs and use cases. Our role is to explore them, and guide you through the decision-making process. When possible, we will state our preferred choices, which we will follow through this book for the practical exercises.

In this chapter, we will cover the following topics:

  • Understanding Kubernetes infrastructure design considerations
  • Exploring Kubernetes deployment strategy alternatives
  • Designing an Amazon EKS infrastructure

Understanding Kubernetes infrastructure design considerations

When it comes to Kubernetes infrastructure design, there are a few, albeit important, considerations to take into account. Almost every cloud infrastructure architecture shares the same set of considerations; however, we will discuss these considerations from a Kubernetes perspective, and shed some light on them.

Scaling and elasticity

Public cloud infrastructure, such as AWS, Azure, and GCP, introduced scaling and elasticity capabilities at unprecedented levels. Kubernetes and containerization technologies arrived to build upon these capabilities and extend them further.

When you design a Kubernetes cluster infrastructure, you should ensure that your architecture covers the following two areas:

  • Scalable Kubernetes infrastructure
  • Scalable workloads deployed to the Kubernetes clusters

To achieve the first requirement, there are parts that depend on the underlying infrastructure, either public cloud or on-premises, and other parts that depend on the Kubernetes cluster itself.

The first part is usually solved when you choose to use a managed Kubernetes service such as EKS, AKS, or GKE, as the cluster's control plane and worker nodes will be scalable and supported by other layers of scalable infrastructure.

However, in some use cases, you may need to deploy a self-managed Kubernetes cluster, either on-premises or in the cloud, and in this case, you need to consider how to support scaling and elasticity to enable your Kubernetes clusters to operate at their full capacity.

In all public cloud infrastructure, there is the concept of compute auto scaling groups, and Kubernetes clusters are built on them. However, because of the nature of the workloads running on Kubernetes, scaling needs should be synchronized with the cluster scheduling actions. This is where Kubernetes cluster autoscaler comes to our aid.

Cluster autoscaler (CAS) is a Kubernetes cluster add-on that you optionally deploy to your cluster, and it automatically scales up and down the size of worker nodes based on the set of conditions and configurations that you specify in the CAS. Basically, it triggers cluster upscaling when there is a pod that cannot schedule due to insufficient compute resources, or it triggers cluster downscaling when there are underutilized nodes, and their pods can be rescheduled and placed in other nodes. You should take into consideration the time a cloud provider takes to execute the launch of a new node, as this could be a problem for time-sensitive apps, and in this case, you may consider CAS configuration that enables node over provisioning.

For more information about CAS, refer to the following link: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler.

To achieve the second scaling requirement, Kubernetes provides two solutions to achieve autoscaling of the pods:

  • Horizontal Pod Autoscaler (HPA): This works similar to cloud autoscaling groups, but at a pod deployment level. Think of the pod as the VM instance. HPA scales the number of pods based on a specific metrics threshold. This can be CPU or memory utilization metrics, or you can define a custom metric. To understand how HPA works, you can continue reading about it here: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.
  • Vertical Pod Autoscaler (VPA): This scales the pod vertically by increasing its CPU and memory limits according to the pod usage metrics. Think of VPA as upscaling/downscaling the VM instance by changing its type in the public cloud. VPA can affect CAS and triggers upscaling events, so you should revise the CAS and VPA configurations to get them aligned and avoid any unpredictable scaling behavior. To understand how VPA works, you can continue reading about it here: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler.

We highly recommend using HPA and VPA for your production deployments (it is not essential for non-production environments). We will give examples on how to use both of them in deploying production-grade apps and services in Chapter 8, Deploying Seamless and Reliable Applications.

High availability and reliability

Uptime means reliability and is usually the top metric that the infrastructure teams measure and target for enhancement. Uptime drives the service-level objectives (SLOs) for services, and the service level agreements (SLAs) with customers, and it also indicates how stable and reliable your systems and Software as a Service (SaaS) products are. High availability is the key for increasing uptime, and when it comes to Kubernetes clusters' infrastructure, the same rules still apply. This is why designing a highly available cluster and workload is an essential requirement for a production-grade Kubernetes cluster.

You can architect a highly available Kubernetes infrastructure on different levels of availability as follows:

  • A cluster in a single public cloud zone (single data center): This is considered the easiest architecture among the others, but it brings the highest risk. We do not recommend this solution.
  • A cluster in multiple zones (multiple data centers) but in a single cloud region: This is still easy to implement, it provides a higher level of availability, and it is a common architecture for Kubernetes clusters. However, when your cloud provider has a full region outage, your cluster will be entirely unavailable. Such full region outages rarely happen, but you still need to be prepared for such a scenario.
  • Across multi-region clusters, but within the same cloud provider: In this architecture, you usually run multiple federated Kubernetes clusters to serve your production workloads. This is usually the preferred solution for high availability, but it comes at a cost that makes it hard to implement and operate, especially the possible poor network performance, and shared storage for stateful applications. We do not recommend this architecture since, for the majority of SaaS products, it is enough to deploy Kubernetes in a single region and multiple zones. However, if you have a multi-region as a requirement for a reason other than high availability, you may consider multi-region Kubernetes federated clusters as a solution.
  • Multiple clusters across multi-cloud deployment: This architecture is still unpopular due to the incompatibility limitations across cloud providers, inter-cluster network complexity, and the higher cost associated with network traffic across providers, along with implementation and operations. However, it is worth mentioning the increase in the number of multi-cloud management solutions that are endeavoring to tackle and solve these challenges, and you may wish to consider a multi-cluster management solution such as Anthos from Google. You can learn more about it here: https://cloud.google.com/anthos.

As you may notice, Kubernetes has different architectural flavors when it comes to high availability setup, and I can say that having different choices makes Kubernetes more powerful for different use cases. Although the second choice is the most common one as of now, as it strikes a balance between the ease of implementation and operation, and the high availability level. We are optimistically searching for a time when we can reach the fourth level, where we can easily deploy Kubernetes clusters across cloud providers and gain all the high availability benefits without the burden of tough operations and increased costs.

As for the cluster availability itself, I believe it goes without saying that Kubernetes components should run in a highly available mode, that is, having three or more nodes for a control plane, or preferably letting the cloud manage the control plane for you, as in EKS, AKE, or GKE. As for workers, you have to run one or more autoscaling groups or node groups/pools, and this ensures high availability.

The other area where you need to consider achieving high availability is for the pods and workloads that you will deploy to your cluster. Although this is beyond the scope of this book, it is still worthwhile mentioning that developing new applications and services, or modernizing your existing ones so that they can run in a high availability mode, is the only way to make use of the raft of capabilities provided by the powerful Kubernetes infrastructure underneath it. Otherwise, you will end up with a very powerful cluster but with monolithic apps that can only run as a single instance!

Security and compliance

Kubernetes infrastructure security is rooted at all levels of your cluster, starting from the network layer, going through the OS level, up to cluster services and workloads. Luckily, Kubernetes has strong support for security, encryption, authentication, and authorization. We will learn about security in Chapter 6, Securing Kubernetes Effectively, of this book. However, during the design of the cluster infrastructure, you should give attention to important decisions relating to security, such as securing the Kubernetes API server endpoint, as well as the cluster network design, security groups, firewalls, network policies between the control plane components, workers nodes, and the public internet.

You will also need to plan ahead in terms of the infrastructure components or integrations between your cluster and identity management providers. This usually depends on your organization's security policies, which you need to align with your IT and security teams.

Another aspect to consider is the auditing and compliance of your cluster. Most organizations have cloud governance policies and compliance requirements, which you need to be aware of before you proceed with deploying your production on Kubernetes.

If you decide to use a multi-tenant cluster, the security requirements could be more challenging, and setting clear boundaries among the cluster tenants, as well as cluster users from different internal teams, may result in decisions such as deploying a service mesh, hardening cluster network policies, and implementing a tougher Role-Based Access Control (RBAC) mechanism. All of this will impact your decisions while architecting the infrastructure of your first production cluster.

The Kubernetes community is keen on compliance and quality, and for that there are multiple tools and tests to ensure that your cluster achieves an acceptable level of security and compliance. We will learn about these tools and tests in Chapter 6, Securing Kubernetes Effectively.

Cost management and optimization

Cloud cost management is an important factor for all organizations adopting cloud technology, both for those just starting and those who are already in the cloud. Adding Kubernetes to your cloud infrastructure is expected to bring cost savings, as containerization enables you to highly utilize your computer resources on a scale that was not possible with VMs ever before. Some organizations achieved cost savings up to 90% after moving to containers and Kubernetes.

However, without proper cost control, costs can rise again, and you end up with a lot of wasted infrastructure cost with uncontrolled Kubernetes clusters. There are many tools and best practices to consider in relation to cost management, but we mainly want to focus on the actions and the technical decisions that you need to consider during infrastructure design.

We believe that there are two important aspects that require decisions, and these decisions will definitely affect your cluster infrastructure architecture:

  • Running a single, but multi-tenant, cluster versus multi clusters (that is, a single cluster per tenant)
  • The cluster capacity: whether to run few large worker nodes or a lot of small workers nodes, or a mix of the two

There are no definitive correct decisions, but we will try to explore the choices in the next section, and how we can reach a decision.

These are other considerations to be made regarding cost optimization where an early decision can be made:

  • Using spot/preemptible instances: This has proven to achieve huge cost savings; however, it comes at a price! There is the threat of losing your workloads at any time, which affects your product uptime and reliability. Options are available for overcoming this, such as using spot instances for non-production workloads, such as development environments or CI/CD pipelines, or any production workloads that can survive a disruption, such as data batch processing.

    We highly recommend using spot instances for worker nodes, and you can run them in their node group/pool and assign to them the types of workloads where you are not concerned with them being disrupted.

  • Kubernetes cost observability: Most cloud platforms provide cost visibility and analytics for all cloud resources. However, having cost visibility at the deployment/service level of the cluster is essential, and this needs to be planned ahead, so you use isolated workloads, teams, users, environments, and also using namespaces and assign resource quotas to them. By doing that, you will ensure that using a cost reporting tool will provide you with reports relating the usage to the service or cluster operations. This is essential for further decision making regarding cost reductions.
  • Kubernetes cluster management: When you run a single-tenant cluster, or one cluster per environment for development, you usually end up with tons of clusters sprawled across your account which could lead to increased cloud cost. The solution to this situation is to set up a cluster management solution from day one. This solution could be as simple as a cluster auto scaler script that reduces the worker nodes during periods of inactivity, or it can be a full automation with dashboards and a master cluster to manage the rest of clusters.

In Chapter 9, Monitoring, Logging, and Observability, and Chapter 10, Operating and Maintaining Efficient Kubernetes Clusters, we will learn about cost observability and cluster operations.

Manageability and operational efficiency

Usually, when an organization starts building a Kubernetes infrastructure, they invest most of their time, effort, and focus in urgent and critical demands for infrastructure design and deployment, which we usually call Day 0 and Day 1. It is unlikely that an organization will devote its attention to operational and manageability concerns that we will face in the future (Day 2).

This is justified by the lack of experience in Kubernetes, and the types of operational challenges, or by being driven by gaining the benefits of Kubernetes that mainly relate to development, such as increasing a developer's productivity and agility, and automating releases and deployment.

All of this leads to organizations and teams being less prepared for Day 2. In this book, we try to maintain a balance between design, implementation, and operations, and shed some light on the important aspects of the operation and learn how to plan for it from Day 0, especially in relation to reliability, availability, security, and observability.

Operational challenges with Kubernetes

These are the common operational and manageability challenges that most teams face after deploying Kubernetes in production. This is where you need to rethink and consider solutions beforehand in order to handle these challenges properly:

  • Reliability and scaling: When your infrastructure scales up, you could end up with tens or hundreds of clusters, or clusters with hundreds or thousands of nodes, and tons of configurations for different environment types. This makes it harder to manage the SLAs/SLOs of your applications, as well as the uptime goals, and even diagnosing a cluster issue could be very problematic. Teams need to develop their Kubernetes knowledge and troubleshooting skills.
  • Observability: No doubt Kubernetes is complex, and this makes monitoring and logging a must-have service once your cluster is serving production, otherwise you will have a very tough time identifying issues and problems. Deploying monitoring and logging tools, in addition to defining the basic observability metrics and thresholds, are what you need to take care of in this regard. 
  • Updateability and cluster management: Updating Kubernetes components, such as the API server, kubelet, etcd, kube-proxy, Docker images, and configuration for the cluster add-ons, become challenging to manage during the cluster life cycle. This requires the correct tools to be in place from the outset. Automation and IaC tools, such as Terraform, Ansible, and Helm, are commonly used to help in this regard.
  • Disaster recovery: What happens when you have a partial or complete cluster failure? What is the recovery plan? How do you mitigate this risk and decrease the mean time to recover your clusters and workloads. This requires deployment of the correct tools, and writing the playbooks for backups, recovery, and crisis management.
  • Security and governance: You need to ensure that security best practices and governance policies are applied and enforced in relation to production clusters and workloads. This becomes challenging due to the complex nature of Kubernetes and its soft isolation techniques, its agility, and the rapid pace it brings to the development and release life cycles.

There are other operational challenges. However, we found that most of these can be mitigated if we stick to the following infrastructure best practices and standards:

  • Infrastructure as Code (IaC): This is the default practice for modern infrastructure and DevOps teams. It is also a recommended approach to use declarative IaC tools and technologies over their imperative counterparts.
  • Automation: We live in the age of software automation, as we tend to automate everything; it is more efficient and easier to manage and scale, but we need to take automation with Kubernetes to another level. Kubernetes comes with the ability to automate the life cycle of containers, and it also comes with advanced automation concepts, such as operators and GitOps, which are efficient and can literally automate automations.
  • Standardization: Having a set of standards helps to reduce teams' struggles with aligning and working together, eases the scaling of the processes, improves the overall quality, and increases productivity. This becomes essential for companies and teams that are planning to use Kubernetes in production, as this involves integrating with different infrastructure parts, migrating services from on-premises to the cloud, and many further complexities.

    Defining your set of standards covers processes for operation runbooks and playbooks, as well as technology standardization – using Docker, Kubernetes, and standard tools across teams. These tools should have specific characteristics: open source but battle-tested in production, the ability to support the other principles, such as IaC code, immutability, being cloud-agnostic, and being simple to use and deploy with a minimum of infrastructure.

  • Single source of truth: Having a source of truth is a cornerstone and enabler to modern infrastructure management and configuration. Source code control systems such as Git are becoming the standard choice to store and version infrastructure code, where having a single and dedicated source code repository for infrastructure is the recommended practice to follow.

Managing Kubernetes infrastructure is about management complexity. Hence, having a solid infrastructure design, applying best practices and standards, increasing the team's Kubernetes-specific skills, and expertise will all result in a smooth operational and manageability journey.


Exploring Kubernetes deployment strategy alternatives

Kubernetes and its ecosystem come with vast choices for everything you can do related to deploying, orchestrating, and operating your workloads. This flexibility is a huge advantage, and enables Kubernetes to suit different use cases, from regular applications on-premises and in the cloud to IoT and edge computing. However, choices come with responsibility, and in this chapter, we learn about the technical decisions that you need to evaluate and take regarding your cluster deployment architecture..

One of the important questions to ask and a decision to make is where to deploy your clusters, and how many of them you may need in order to run your containerized workloads? The answer is usually driven by both business and technical factors; elements such as the existing infrastructure, cloud transformation plan, cloud budget, the team size, and business growth target. All of these aspects could affect this, and this is why the owner of the Kubernetes initiative has to collaborate with organization teams and executives to reach a common understanding of the decision drivers, and agree on the right direction for their business.

We are going to explore some of the common Kubernetes deployment architecture alternatives, with their use cases, benefits, and drawbacks:

  • Multi-availability-zones clusters: This is the mainstream architecture for deploying a high availability (HA) cluster in a public cloud. Because running clusters in a multi-availability zones is usually supported by all public cloud providers, and, at the same time, it achieves an acceptable level of HA. This drives the majority of new users of Kubernetes to opt for this choice. However, if you have essential requirements to run your workloads in different regions, this option will not be helpful.
  • Multi-region clusters: Unless you have a requirement to run your clusters in multiple regions, there is little motivation to opt for it. While a public cloud provider to lose an entire region is a rare thing, but if you have the budget to do a proper design and overcome the operational challenges, then you can opt for a multi-region setup. It will definitely provide you with enhanced HA and reliability levels.
  • Hybrid cloud clusters: A hybrid cloud is common practice for an organization migrating from on-premise to the public cloud and that is going through a transitional period where they have workloads or data split between their old infrastructure and the new cloud infrastructure. Hybrid could also be a permanent setup, where an organization wants to keep part of its infrastructure on-premise either for security reasons (think about sensitive data), or due to the impossibility of migrating to the cloud. Kubernetes is an enabler of the hybrid cloud model, especially with managed cluster management solutions such as Google Anthos. This nevertheless entails higher costs in terms of provision and operation.
  • Multi-cloud clusters: Unlike hybrid cloud clusters, I find multi-cloud clusters to be an uncommon pattern, as it usually lacks the strong drivers behind it. You can run multiple different systems in multi-cloud clusters for a variety of reasons, but deploying a single system across two or more clouds over Kubernetes is not common, and you should be cautious before moving in this direction. However, I can understand the motivating factors behind some organizations doing this, such as avoiding cloud lock-in with a particular provider, leveraging pricing models with different providers for cost optimization, minimizing latency, or even achieving ultimate reliability for the workloads.
  • On-premises clusters: If an organization decides not to move to the cloud, Kubernetes still can manage their infrastructure on-premises, and actually, Kubernetes is a reasonable choice to manage the on-prem workload in a modern fashion, however, the solid on-prem managed Kubernetes solutions still very few.
  • Edge clusters: Kubernetes is gaining traction in edge computing and the IoT world. It provides an abstraction to the underlying hardware, it is ideal for distributed computing needs, and the massive Kubernetes ecosystem helps to come out with multiple open source and third-party projects that fit edge computing nature, such as KubeEdge and K3s.
  • Local clusters: You can run Kubernetes on your local machine using tools such as Minikube or Kind (Kubernetes in Docker). The purpose of using a local cluster is for trials, learning, and for use by developers.

We have discussed the various clusters deployments architectures and models available and their use cases. In the next section, we will learn work on designing the Kubernetes infrastructure that we will use in this book, and the technical decisions around it..


Designing an Amazon EKS infrastructure

In this chapter, we have discussed and explored various aspects of Kubernetes clusters design, and the different architectural considerations that you need to take into account. Now, we need to put things together for the design that we will follow during this book. The decisions that we will make here do not mean that they are the only right ones, but this is the preferred design that we will follow in terms of having minimally acceptable production clusters for this book's practical exercise. You can definitely use the same design, but with modifications, such as cluster sizing.

In the following sections, we will explore our choices regarding the cloud provider, provisioning and configuration tools, and the overall infrastructure architecture, and in the chapters to follow, we will build upon these choices and use them to provision production-like clusters as well as deploy the configuration and services above the cluster.

Choosing the infrastructure provider

As we learned in the previous sections, there are different ways in which to deploy Kubernetes. You can deploy it locally, on-premises, or in a public cloud, private cloud, hybrid, multi-cloud, or an edge location. Each of these infrastructure type has use cases, benefits, and drawbacks. However, the most common one is the public cloud, followed by the hybrid model. The remaining choices are still limited to specific use cases.

In a single book like ours, we cannot discuss each of these infrastructure platforms, so we decided to go with the common choice for deploying Kubernetes, by using one of the public clouds (AWS, Azure, or GCP). You still can use another cloud provider, a private cloud, or even an on-premises setup, and most of the concepts and best practices discussed in this book are still applicable.

When it comes to choosing one of the public clouds, we do not advocate one over the others, and we definitely recommend using the cloud provider that you already use for your existing infrastructure, but if you are just embarking on your cloud journey, we advise you to perform a deeper benchmarking analysis between the public clouds to see which one is better for your business.

In the practical exercises in this book, we will use AWS and the Elastic Kubernetes Service (EKS). We explained in the previous chapter regarding the infrastructure design principle that we always prefer a managed service over its self-managed counterpart, and this applies here when it comes to choosing between EKS and building our self-managed clusters over AWS.

Choosing the cluster and node size

When you plan for your cluster, you need to decide both the cluster and node sizes. This decision should be based on the estimated utilization of your workloads, which you may know beforehand based on your old infrastructure, or it can be calculated approximately and then adjusted after going live in production. In either case, you will need to decide on the initial cluster and node sizes, and then keep adjusting them until you reach the correct utilization level to achieve a balance between cost and reliability. You can target a utilization level of between 70 and 80% unless you have a solid justification for using a different level.

These are the common cluster and node size choices that you can consider either individually or in a combination:

  • Few large clusters: In this setup, you deploy a few large clusters. These can be production and non-production clusters. A cluster could be large in terms of node size, node numbers, or both. Large clusters are usually easier to manage because they are few in number. They are cost efficient because you achieve higher utilization per node and cluster (assuming you are running the correct amount of workloads), and this improved utilization comes from saving the resources required for system management. On the downside, large clusters lack hard isolation for multi-tenants, as you only use namespaces for soft isolation between tenants. They also introduce a single point of failure to your production (especially when you run a single cluster). There is another limitation, as any Kubernetes cluster has an upper limit of 5,000 nodes that it can manage and when you have a single cluster, you can hit this upper limit if you are running a large number of pods.
  • Many small clusters: In this setup, you deploy a lot of small clusters. These could be small in terms of node size, node numbers, or both. Small clusters are good when it comes to security as they provide hard isolation between resources and tenants and also provide strong access control for organizations with multiple teams and departments. They also reduce the blast radius of failures and avoid having a single point of failure. On the downside, small clusters come with an operational overhead, as you need to manage a fleet of clusters. They are also inefficient in terms of resource usage, as you cannot achieve the utilization levels that you can achieve with large clusters, in addition to increasing costs, as they require more control plane resources to manage a fleet of small clusters that manage the same total number of worker nodes in a large cluster.
  • Large nodes: This is about the size of the nodes in a cluster. When you deploy large nodes in your cluster, you will have better and higher utilization of the node (assuming you deploy workloads that utilize 70-80% of the node). This is because a large node can handle application spikes, and it can handle applications with high CPU/memory requirements. In addition to that, a well utilized large node usually entails cost savings as it reduces the overall cluster resources required for system management and you can purchase such nodes at discounted prices from your cloud provider. On the downside, large nodes can introduce a high blast radius of failures, thereby affecting the reliability of both the cluster and apps. Also, adding a new large node to the cluster during an upscaling event will add a lot of cost that you may not need, so if your cluster is hit by variable scaling events over a short period, large nodes will be the wrong choice. Added to this is the fact that Kubernetes has an upper limit in terms of the number of pods that can run on a single node regardless of its type and size, and for a large node, this limitation could lead to underutilization.
  • Small nodes: This is about the size of the nodes per single cluster. When you deploy small nodes in your cluster, you can reduce the blast radius during failures, and also reduce costs during upscaling events. On the downside, small nodes are underutilized, they cannot handle applications with high resource requirements, and the total amount of system resources required to manage these nodes (kubelet, etcd, kube-proxy, and so on) is higher than managing the same compute power for a larger node, in addition to which small nodes have a lower limit for pods per node.
  • Centralized versus decentralized clusters: Organizations usually use one of these approaches in managing their Kubernetes clusters.

    In a decentralized approach, the teams or individuals within an organization are allowed to create and manage their own Kubernetes clusters. This approach provides flexibility for the teams to get the best out of their clusters, and customize them to fit their use cases; on the other hand, this increases the operational overhead, cloud cost, and makes it difficult to enforce standardization, security, best practices, and tools across the clusters. This approach is more appropriate for organizations that are highly decentralized, or when they are going through cloud transformation, product life cycle transitional periods, or exploring and innovating new technologies and solutions.

    In a centralized approach, the teams or individuals share a single cluster or small group of identical clusters that use a similar set of standards, configurations, and services. This approach overcomes and decreases the drawbacks in the decentralized model; however, it can be inflexible, slow down the cloud transformations, and decreases teams' agility. This approach is more suitable for organizations working towards maturity, platform stability, increasing cloud cost reduction, enforcing and promoting standards and best practices, and focusing on products rather than the underlaying platform.

Some organizations can run a hybrid models from the aforementioned alternatives, such as having large, medium, and small nodes to get the best of each type according to their apps needs. However, we recommend that you run experiments to decide which model suits your workload's performance, and meets your cloud cost reduction goal.

Choosing tools for cluster deployment and management

In the early days of Kubernetes, we used to deploy it from scratch, which was commonly called Kubernetes the Hard Way. Fast forward and the Kubernetes community got bigger and a lot of tools emerged to automate the deployment. These tools range from simple automation to complete one-click deployment.

In the context of this book, we are not going to explain each of these tools in the market (there are a lot), nor to compare and benchmark them. However, we will propose our choices with a brief reasoning behind the choices.

Infrastructure provisioning

When you deploy Kubernetes for the first time, most likely you will use a command-line tool with a single command to provision the cluster, or you may use a cloud provider web console to do that. In both ways, this approach is suitable for experimental and learning purposes, but when it comes to real implementation across production and development environments a provisioning tool becomes a must.

The majority of organizations that consider deploying Kubernetes already have an existing cloud infrastructure or they are going through a cloud migration process. This makes Kubernetes not the only piece of the cloud infrastructure that they will use. This is why we prefer a provisioning tool that achieves the following:

  • It can be used to provision Kubernetes as well as other pieces of infrastructure (databases, file stores, API gateways, serverless, monitoring, logging, and so on).
  • It fulfills and empowers the IaC principles.
  • It is a cloud-agnostic tool.
  • It has been battle-tested in production by other companies and teams.
  • It has community support and active development.

We can find these characteristics in Terraform, and this is why we chose to use it in the production clusters that we managed, as well as in this practical exercise in this book. We highly recommend Terraform for you as well, but if you prefer another portioning tool, you can skip this chapter and then continue reading this book and apply the same concepts and best practices.

Configuration management

Kubernetes configuration is declarative by nature, so, after deploying a cluster, we need to manage its configuration. The add-ons deployed provide services for various areas of functionality, including networking, security, monitoring, and logging. This is why a solid and versatile configuration management tool is required in your toolset.

The following are solid choices:

  • Regular configuration management tools, such as Ansible, Chef, and Puppet
  • Kubernetes-specific tools, such as Helm and Kustomize
  • Terraform

Our preferred order of suitable tools is as follows:

  1. Ansible
  2. Helm
  3. Terraform

We can debate this order, and we believe that any of these tools can fulfill the configuration management needs for Kubernetes clusters. However, we prefer to use Ansible for its versatility and flexibility as it can be used for Kubernetes and also for other configuration management needs for your environment, which makes it preferable over Helm. On the other hand, Ansible is preferred over Terraform because it is a provisioning tool at heart, and while it can handle configuration management, it is not the best tool for that.

In the hands-on exercises in this book, we decided to use Ansible with Kubernetes module and Jinja2 templates.

Deciding the cluster architecture

Each organization has its own way of managing cloud accounts. However, we recommend having at least two AWS accounts, one for production and another for non-production. The production Kubernetes cluster resides in the production account, and the non-production Kubernetes cluster resides in the non-production account. This structure is preferred for security, reliability, and operational efficiency.

Based on the technical decisions and choices that we made in the previous sections, we propose the following AWS architecture for the Kubernetes clusters that we will use in this book, which you can also use to deploy your own production and non-production clusters:

Figure 2.1 – Cluster architecture diagram

Figure 2.1 – Cluster architecture diagram

In the previous architecture diagram, we decided to do the following:

  • Create a separate VPC for the cluster network; we chose the Classless Inter-Domain Routing (CIDR) range, which has sufficient IPv4 addressing capacity for future scaling. Each Kubernetes node, pod, and service will have its own IP address, and we should keep in mind that the number of services will increase.
  • Create public and private subnets. The publicly accessible resources, such as load balancers and bastions, are placed in the public subnets, and the privately accessible resources, such as Kubernetes nodes, databases, and caches, are placed in the private subnets.
  • For high availability, we create the resources in three different availability zones. We placed one private and one public subnet in each availability zone.
  • For scaling, we run multiple EKS node groups.

We will discuss the details of these design specs in the next chapters, in addition to the remainder of the technical aspects of the cluster's architecture.



Provisioning a Kubernetes cluster can be a task that takes 5 minutes with modern tools and managed cloud services; however, thus this is far from a production-grade Kubernetes infrastructure and it is only sufficient for education and trials. Building a production-grade Kubernetes cluster requires hard work in designing and architecting the underlying infrastructure, the cluster, and the core services running above it.

By now, you have learned about the different aspects and challenges you have to consider while designing, building, and operating your Kubernetes clusters. We explored the different architecture alternatives to deploy Kubernetes clusters, and the important technical decisions associated with this process. Then, we discussed the proposed cluster design, which we will use during the book for the practical exercises, and we highlighted our selection of infrastructure platform, tools, and architecture.

In the next chapter, we will see how to put everything together and use the design concepts we discussed in this chapter to write IaC and follow industry best practices with Terraform to provision our first Kubernetes cluster.


Further reading

For more information on the topics covered in this chapter, please refer to the following links:

About the Authors

  • Aly Saleh

    Aly Saleh is a technology entrepreneur, cloud transformation leader, and architect. He has worked for the past 2 decades on building large-scale software solutions and cloud-based platforms and services that are used by millions of users. He is a co-founder of MAVS Cloud, a start-up that empowers organizations to leverage the power of the cloud. He also played various technical roles at Oracle, Vodafone, FreshBooks, Aurea Software, and Ceros. Aly holds degrees in computer science, and he has gained multiple credentials in AWS, GCP, and Kubernetes, with a focus on building cloud platforms, app modernization, containerization, and architecting distributed systems. He is an advocate for cloud best practices, remote work, and globally distributed teams.

    Browse publications by this author
  • Murat Karslioglu

    Murat Karslioglu is a distinguished technologist with years of experience using infrastructure tools and technologies. Murat is currently the VP of products at MayaData, a start-up that builds data agility platform for stateful applications, and a maintainer of open source projects, namely OpenEBS and Litmus. In his free time, Murat is busy writing practical articles about DevOps best practices, CI/CD, Kubernetes, and running stateful applications on popular Kubernetes platforms on his blog, Containerized Me. Murat also runs a cloud-native news curator site, The Containerized Today, where he regularly publishes updates on the Kubernetes ecosystem.

    Browse publications by this author

Latest Reviews

(2 reviews total)
Retard à la livraison j'attends toujours
Schnelle und übersichtliche Abwicklung des Einkaufs.
Book Title
Unlock this book and the full library for only $5/m
Access now