In the past few years, the software world has evolved at a very high pace. One of my favorite examples of evolution is FinTech, a new field whose name comes from the fusion of finance and technology. In this field, companies tend to build financial products in a disruptive way up to a point that they are threatening the big traditional banks and putting them in jeopardy.
This happens mainly due to the fact that big companies lose the ability to be cost-effective in their IT systems and banks are fairly big companies. It is not strange that banks still run their systems in an IBM mainframe and are reluctant to move to the cloud, and it is also not strange that the core components of the banks are still COBOL applications that haven't been renewed since the 90s. This wouldn't be bad if it wasn't because a small number of talented engineers with an AWS or Google Cloud Platform account can actually build a service that could virtually replace some bank products such as currency exchange or even a broker.
This has become a norm in the last few years, and one of the keys for the success of small companies in FinTech is partially due to DevOps and partially due to its scale. Usually, big companies commoditize the IT systems over time, outsourcing them to third parties that work on price, pushing the quality aside. This is a very effective cost-cutting measure, but it has a downside: you lose the ability to deliver value quickly.
In this chapter, we are going to put DevOps into perspective and see how it can help us create cost-effective work units that can deliver a lot of value in a very short period of time.
There is a famous quote by Henry Ford, the creator of Ford (the popular car-maker brand):
“If I had asked people what they wanted, they would have said faster horses.”
This is what happened with the traditional system administrator role: people were trying to solve the wrong problem.
By the wrong problem, I mean the lack of proper tools to automate the intervention in production systems, avoiding the human error (which is more common than you may think) and leading to a lack of communication continuity in the processes of your company.
Initially, DevOps was the intersection of development and operations as well as QA. The DevOps engineer is supposed to do everything and be totally involved in the SDLC (software development life cycle), solving the communication problems that are present in the traditional release management. This is ideal and, in my opinion, is what a full stack engineer should do: end-to-end software development, from requirement capture to deployments and maintenance.
Nowadays, this definition has been bent up to a point where a DevOps engineer is basically a systems engineer using a set of tools to automate the infrastructure of any company. There is nothing wrong with this definition of DevOps, but keep in mind that we are losing a very competitive advantage: the end-to-end view of the system. In general, I would not call this actor a DevOps engineer but an Site reliability engineering (SRE). This was a term introduced by Google few years back, as sometimes (prominently in big companies), is not possible to provide a single engineer with the level of access required to execute DevOps. We will talk more about this role in the next section, SRE model.
In my opinion, DevOps is a philosophy more than a set of tools or a procedure: having your engineers exposed to the full life cycle of your product requires a lot of discipline but gives you an enormous amount of control over what is being built. If the engineers understand the problem, they will solve it; it is what they are good at.
In the last few years, we have gone through a revolution in IT: it sparkled from pure IT companies to all the sectors: retail, banking, finance, and so on. This has led to a number of small companies called start-ups, which are basically a number of individuals who had an idea, executed it, and went to the market in order to sell the product or the service to a global market (usually). Companies such as Amazon or Alibaba, not to mention Google, Apple, Stripe or even Spotify, have gone from the garage of one of the owners to big companies employing thousands of people.
One thing in common in the initial spark with these companies has always been corporate inefficiency: the bigger the company, the longer it takes to complete simple tasks.
Example of corporate inefficiency graph
This phenomenon creates a market on its own, with a demand that cannot be satisfied with traditional products. In order to provide a more agile service, these start-ups need to be cost-effective. It is okay for a big bank to spend millions on its currency exchange platform, but if you are a small company making your way through, your only possibility against a big bank is to cut costs by automation and better processes. This is a big drive for small companies to adopt better ways of doing things, as every day that passes is one day closer to running out of cash, but there is a bigger drive for adopting DevOps tools: failure.
Failure is a natural factor for the development of any system. No matter how much effort we put in, failure is always there, and at some point, it is going to happen.
Usually, companies are quite focused on removing failure, but there is a unwritten rule that is keeping them from succeeding: the 80-20 rule:
- It takes 20% of time to achieve 80% of your goals. The remaining 20% will take 80% of your time.
Spending a huge amount of time on avoiding failure is bound to fail, but luckily, there is another solution: quick recovery.
Up until now, in my work experience, I have only seen one company asking "what can we do if this fails at 4 A.M. in the morning?" instead of "what else can we do to avoid this system from failing?", and believe me, it is a lot easier (especially with the modern tools) to create a recovery system than to make sure that our systems won't go down.
All these events (automation and failure management) led to the development of modern automation tools that enabled our engineers to:
- Automate infrastructure and software
- Recover from errors quickly
DevOps fits perfectly into the small company world (start-ups): some individuals that can access everything and execute the commands that they need to make the changes in the system quickly. Within these ecosystems is where DevOps shines.
This level of access in traditional development models in big companies is a no-go. It can be an impediment even at a legal level if your system is dealing with highly confidential data, where you need to get your employees security clearance from the government in order to grant them access to the data.
It can also be convenient for the company to keep a traditional development team that delivers products to a group of engineers that runs it but works closely with the developers so that the communication is not an issue.
SREs also use DevOps tools, but usually, they focus more on building and running a middleware cluster (Kubernetes, Docker Swarm, and so on) that provides uniformity and a common language for the developers to be abstracted from the infrastructure: they don't even need to know in which hardware the cluster is deployed; they just need to create the descriptors for the applications that they will deploy (the developers) in the cluster in an access-controlled and automated manner in a way that the security policies are followed up.
SRE is a discipline on its own, and Google has published a free ebook about it, which can be found at https://landing.google.com/sre/book.html.
I would recommend that you read it as it is a fairly interesting point of view.
Through the years, companies have pushed the development of their IT systems out of their business core processes: retail shop business was retail and not software but reality has kicked in very quickly with companies such as Amazon or Alibaba, which can partially attribute their success to keeping their IT systems in the core of the business.
A few years ago, companies used to outsource their entire IT systems, trying to push the complexity aside from the main business in the same way that companies outsource the maintenance of the offices where they are. This has been successful for quite a long time as the release cycles of the same applications or systems were long enough (a couple of times a year) to be able to articulate a complex chain of change management as a release was a big bang style event where everything was measured to the millimeter with little to no tolerance for failure.
Usually, the life cycle for such projects is very similar to what is shown in the following diagram:
This model is traditionally known as waterfall (you can see its shape), and it is borrowed from traditional industrial pipelines where things happen in very well-defined order and stages. In the very beginning of the software industry, engineers tried to retrofit the practices from the traditional industry to software, which, while a good idea, has some drawbacks:
- Old problems are brought to a new field
- The advantages of software being intangible are negated
With waterfall, we have a big problem: nothing moves quickly. No matter how much effort is put into the process, it is designed for enormous software components that are released few times a year or even once a year. If you try to apply this model to smaller software components, it is going to fail due to the number of actors involved in it. It is more than likely that the person who captures the requirements won't be involved in the development of the application and, for sure, won't know anything about the deployment.
I remember that when I was a kid, we used to play a game called the crazy phone. Someone would make up a story with plenty of details and write it down on paper. This person read the story to another person, who had to capture as much as possible and do the same to the next person, up until we reached the end of the number of people playing this game. After four people, it was almost guaranteed that the story wouldn't look anywhere close to the initial one, but there was a more worrying detail: after the first person, the story would never be the same. Details would be removed and invented, but things would surely be different.
This exact game is what we are trying to replicate in the waterfall model: people who are working on the requirements are creating a story that is going to be told to developers, who are creating another story that is going to be told to QA so that they can test that the software product delivered matches with a story that was in two hands (at the very least) before reaching them.
As you can see, this is bound to be a disaster but hold on, what can we do to fix it? If we look at the traditional industry, we'll see that they never get their designs wrong or, at least, the error rate is very small. The reason for that (in my opinion) is that they are building tangible things, such as a car or a nuclear reactor, which can easily be inspected and believe me or not, they are usually simpler than a software project. If you drive a car, after a few minutes, you will be able to spot problems with the engine, but if you start using a new version of some software, it might take a few years to spot security problems or even functional problems.
In software, we tried to ease this problem by creating very concise and complex diagrams using Unified Modeling Language (UML) so that we capture the single source of truth and we can always go back to it to solve problems or validate our artifacts. Even though this is a better approach, it is not exempt from problems:
- Some details are hard to capture in diagrams
- People in the business stakeholders do not understand UML
- Creating diagrams requires time
Particularly, the fact that the business stakeholders do not understand UML is the big problem here. After the capture of requirements, changing them or even raising questions on lower levels (development, operations, and so on) requires involving some people, and at least one of them (the business stakeholder) does not understand the language of where the requirements were captured. This wouldn't be a problem if the project requirements were spot on since the first iteration, but in how many projects have you been involved where the requirements were static? The answer is none.
Once we have made it clear that we have a communication problem, bugs are expected to arise during our process. Either a misalignment with the requirements or even the requirements being wrong usually leads to a defect that could prevent us from deploying the application to production and delay everything.
In waterfall, fixing a bug is increasingly possible in every step we take. For example, fixing a bug in the requirements phase is very straightforward: just update the diagrams/documentation, and we are done. If the same bug is captured by a QA engineer in the verification phase, we need to:
- Update the documents/diagrams
- Create a new version of the application
- Deploy the new version to the QA environment
If the bug is caught in production, you can imagine how many steps are involved in fixing it, not to mention the stress, particularly if the bug compromises the revenue of your company.
A few years ago, I used to work in a company where the production rollouts steps were written in a Microsoft Word document command by command along with the explanation:
- Copy this file there:
cp a.tar b.tar
- Restart the server
xyzwith the command:
sudo service my-server restart
This was in addition to a long list of actions to take to release a new version. This happened because it was a fairly big company that had commoditized its IT department, and even though their business was based on an IT product, they did not embed IT in the core of their business.
As you can see, this is a very risky situation. Even though the developer who created the version and the deployment document was there, someone was deploying a new WAR (a Java web application packed in a file) in a production machine, following the instructions blindly. I remember asking one day: if this guy is executing the commands without questioning them, why don’t we just write a script that we run in production? It was too risky, they said.
They were right about it: risk is something that we want to reduce when deploying a new version of the software that is being used by some hundred thousand people on a single day. In fairness, risk is what pushed us to do the deployment at 4 A.M. instead of doing it during business hours.
The problem I see with this is that the way to mitigate the risks (deploy at 4 A.M in the morning when no one is buying our product) creates what we call, in IT, a single point of failure: the deployment is some sort of all or nothing event that is massively constrained by the time, as at 8 A.M., the traffic in the app usually went from two visits per hour to thousands per minute, around 9 A.M. being the busiest period of the day.
That said, there were two possible outcomes from the rollout: either the new software gets deployed or not. This causes stress to the people involved, and the last thing you want to have is stressed people playing with the systems of a multi-million business.
Let’s take a look at the maths behind a manual deployment, such as the one from earlier:
Remove the old version of the app (the WAR file)
Copy the new version of the app (the WAR file)
Update properties in configuration files
This describes the steps involved in releasing a new version of the software in a single machine. The full company system had a few machines, so the process would have to be repeated a number of times, but let's keep it simple; assume that we are only rolling out to a single server.
Now a simple question: what is the overall failure rate in the process?
We naturally tend to think that the probability of a failure in a chained process such as the preceding list of instructions is the biggest in any step of the chain: 5%. That is not true. In fairness, it is a very dangerous, cognitive bias. We usually take very risky decisions due to the false perception of low risk.
Let's use the math to calculate the probability of failure:
The preceding list is a list of dependent events. We cannot execute step number 6 if step 4 failed, so the formula that we are going to apply is the following one:
P(T) = P(A1)*P(A2)…*P(An)
This leads to the following calculation:
P(T) = (99.5/100) * (99.5/100) * (98/100) * (98/100) * (95/100) * (95/100) * (99.5/100) = 0.8538
We are going to be successful only 85.38% of the times. This translated to deployments, which means that we are going to have problems 1 out of 6 times that we wake up at 4 A.M. to release a new version of our application, but there is a bigger problem: what if we have a bug that no one noticed during the production testing that happened just after the release? The answer to this question is simple and painful: the company would need to take down the full system to roll back to a previous version, which could lead to loss of revenue and customers.
A few years ago, when I was in the middle of a manual deployment at 4 A.M., I remember asking myself "there has to be a better way". Tools were not mature enough, and the majority of the companies did not consider IT the core of their business. Then, a change happened: DevOps tools started to do well in the open source community and companies started to create continuous delivery pipelines. Some of them were successful, but a big majority of them failed for two reasons:
- Release management process
- Failure in the organizational alignment
We will talk about organizational alignment later on in this chapter. For now, we are going to focus on the release management process as it needs to be completely different from the traditional release management in order to facilitate the software life cycle.
In the preceding section, we talked about different phases:
We also explained how it works well with gigantic software where we group features into big releases that get executed in a big bang style with all or nothing deployments.
The first try to fit this process into smaller software components was what everyone calls agile, but no one really knew what it was.
In the traditional release management, one of the big problems was the communication: chains of people passing on messages and information, as we've seen, never ends well.
Agile encourages shorter communication strings: the stakeholders are supposed to be involved in the software development management, from the definition of requirements to the verification (testing) of the same software. This has an enormous advantage: teams never build features that are not required. If deadlines need to be met, the engineering team sizes down the final product sacrificing functionality but not quality.
Deliver early and deliver often is the mantra of agile, which basically means defining an Minimum Viable Product (MVP) and delivering it as soon as it is ready in order to deliver value to the customers of your application and then delivering new features as required. With this method, we are delivering value since the first release and getting feedback very early on in the product life.
In order to articulate this way of working, a new concept was introduced: the sprint. A sprint is a period of time (usually 2 weeks) with a set of functionalities that are supposed to be delivered at the end of it into production so that we achieve different effects:
- Customers are able to get value very often
- Feedback reaches the development team every 2 weeks so that corrective actions can be carried on
- The team becomes predictable and savvy with estimates
This last point is very important: if our estimates are off by 10% in a quarter release, it means that we are off by two weeks, whereas in a two weeks sprint, we are off only by 1 day, which, over time, with the knowledge gained sprint after sprint, means the team will be able to adjust due to the fact that the team builds a database of features and time spent on them so that we are able to compare new features against the already developed ones.
These features aren't called features. They are called stories. A story is, by definition, a well-defined functionality with all the info for the development team captured before the sprint starts, so once we start the development of the sprint, developers can focus on technical activities instead of focusing on resolving unknowns in these features.
Not all the stories have the same size, so we need a measurement unit: the story points. Usually, story points do not relate to a time-frame but to the complexity of it. This allows the team to calculate how many story points can be delivered at the end of the sprint, so with time, they get better at the estimates and everybody gets their expectations satisfied.
At the end of every sprint, the team is supposed to release the features developed, tested, and integrated into production in order to move to the next sprint.
The content of the sprints are selected from a backlog that the team is also maintaining and preparing as they go.
The main goal is to meet everyone's expectations by keeping the communication open and be able to predict what is being delivered and when and what is needed for it.
There are several ways of implementing the agile methodologies in our software product. The one explained earlier is called Scrum, but if you look into other development methodologies, you'll see that they all focus on the same concept: improving the communication across different actors of the same team.
If you are interested in Scrum, there is more info at https://en.wikipedia.org/wiki/Scrum_(software_development).
As explained earlier, if we follow the Scrum methodology, we are supposed to deliver a new version every 2 weeks (the duration of a sprint in the majority of the cases), which has a dramatic impact on the resources consumed. Let's do the maths: quarter versus bi-weekly releases:
- In quarter releases, we release only four times a year in addition to emergency releases to fix problems found in production.
- In bi-weekly releases, we release once every 2 weeks in addition to emergency releases. This means 26 releases a year (52 weeks roughly) in addition to emergency releases.
For the sake of simplicity, let's ignore the emergency releases and focus on business as usual in our application. Let's assume this takes us 10 hours to prepare and release our software:
- Quarter releases: 10 x 4 = 40 hours a year
- Bi-weekly releases: 10 x 26 = 260 hours a year
As of now, releasing software is always the same activity, no matter whether we do it every quarter or every day. The implication is the same (roughly), so we have a big problem: our bi-weekly release is consuming a lot of time and it gets worse if we need to release fixes for problems that have been overlooked in QA.
There is only one solution for this: automation. As mentioned earlier, up until 2 years ago (around 2015) the tools to orchestrate automatic deployments weren't mature enough. Bash scripts were common but weren't ideal as bash is not designed to alter the state of production servers.
The first few tools to automate deployments were frameworks to manage the state of servers: Capistrano or Fabric wrapped
ssh access and state management in a set of commands on Ruby and Python, which allowed the developers to create scripts that, depending on the state of the servers, were executing different steps to achieve a goal: deploying a new version.
These frameworks were a good step forward, but there were bigger problems with them: a solution across different companies usually solves the same problem in different ways, which implies that DevOps (developers + ops) engineers need to learn how to handle this in every single company.
The real change came with Docker and orchestration platforms, such as Kubernetes or Docker Swarm. In this book, we will look at how to use them, particularly Kubernetes, to reduce the deployment time from 10 hours (or hours in general) to a simple click, so our 260 hours a year become a few minutes for every release.
This also has a side-effect, which is related to what we explained earlier in this chapter: from a very risky release (remember, 85.38% of success) with a lot of stress, we are moving toward a release that can be patched in minutes, so releasing a bug, even though it is bad, has a reduced impact due to the fact that we can fix it within minutes or even roll back within seconds. We will look at how to do this in Chapter 8, Release Management – Continuous Delivery.
Once we are aligned with these practices, we can even release individual items to production: once a feature is ready, if the deployment is automated and it gets reduced to a single click, why not just roll out the stories as they are completed?
Microservices are a big trend nowadays: small software components that allow companies to manage their systems on vertical slices of functionality, deploying features individually instead of bundling them in a big application, which can be problematic in big teams as the interaction across functionalities often leads to collisions and bugs being released into production without anyone noticing.
An example of quite a successful company using microservices is Spotify. Not only at the technical level but at the business level, they have organized things to be able to orchestrate a large number of services to provide a top class music streaming service that pretty much never fails, and if it does, it is a partial failure:
- Playlists are managed by a microservice; therefore, if it goes down, only playlists are unavailable.
- If the recommendations are not working, the users usually don't even notice it.
This comes at a huge cost: operational overhead. Splitting an application into many requires a proportional amount of operations to keep it running, which can be exponential if it is not handled well. Let's look at an example:
- Our system is composed of five applications: A, B, C, D, and E.
- Each of them is a microservice that is deployed individually and requires around 5 hours a month of operations (deployments, capacity planning, maintenance, and so on)
If we bundle all five applications together into a single big application, our maintenance cost goes down drastically to pretty much the same as any of the previously mentioned microservices. The numbers are clear:
- 25 hours a month for a microservices-based system
- 5 hours a month for a monolithic application
This leads to a problem: if our system grows up to hundreds (yes, hundreds) microservices, the situation becomes hard to manage as it consumes all our time.
The only solution to this is automation. There will always be an operational overhead, but with automation, instead of adding 5 hours a month per service, this time will decrease with time, as once we have automated our interventions, there is pretty much no time consumed by new services as everything happens as a chain of events.
In Chapter 8, Release Management – Continuous Delivery, we are going to set up a continuous delivery pipeline to demonstrate how this is possible, and even though we will have some manual steps for sanity, it is possible to fully automate the operations on a microservices environment running in a cluster such as Kubernetes.
In general, I would not advise any company to start a project based on microservices without proper automation in place and more specifically, if you are convinced that the system will grow over time, Kubernetes would be a very interesting option: it gives you the language that other platforms lack, such as load balancers, routing, ingress, and more. We will dive deep into Kubernetes in the upcoming chapters.
All these activities are supposed to be part of the DevOps engineer's day-to-day work (among many others), but first, there is a problem that we need to solve: how to align our company resources to be able to get the most from the DevOps engineer figure.
Up until now, we have looked at how the modern and traditional release life cycle works. We have also defined what a DevOps engineer is and also how they can help with Microservices, which, as explained, are not viable without the right level of automation.
Apart from technicalities, there is something that is extremely important for the DevOps culture to succeed: organizational alignment.
The traditional software development used to divide teams into different roles:
- Business analysts
- System administrators
- QA engineers
This is what we call horizontal slices: a team of system administrators has a few contact points with the developers so that they get enough information to deploy and maintain software.
In the modern release life cycle, this simply does not work. Instead of horizontal slices of our company, we need to get vertical slices: a team should be composed of at least one member of every horizontal team. This means having developers, business analysts, system administrators, and QA engineers together...well, not 100%.
With the DevOps philosophy, some of these roles become irrelevant or need to evolve. The idea is that a single team is able to build, deploy, and run an application on its own without anything external: this is called cross-functional autonomous team.
In my professional experience, cross-functional teams are the best organization for delivering high-quality reliable products. The product is run by people who build; therefore, they know it inside out. A combination of analysts (depending on the nature of the business), developers, and DevOps engineers is all you need to deliver high-quality software into production. Some teams might as well include a QA engineer, but in general, automated testing created by DevOps and developers should be the holy grail: it is impossible to release software in a continuous delivery manner without having good code coverage. I am a big fan of the analyst being the one that tests the software as he/she is the person who knows the best the requirements and is, therefore, the most indicated to validating them.
The DevOps engineer plays a cross-cutting role: they need to know how the application is built (and possibly be part of its development), but their focus is related to the operation of the app: security, operational readiness, infrastructure, and testing should be their day-to-day job.
I have also seen teams built entirely by DevOps engineers and analysts without any pure developers or QAs. In this variant, the DevOps engineers are responsible for the infrastructure part as well as the application development, which can be very challenging depending on the complexity of the system. In general, every case needs to be studied in isolation as DevOps is not a one size fits all product.
Now that we have introduced DevOps, it is time to specify what are we going to learn in this book. It will be mainly focused on the Google Cloud Platform and the DevOps tools around it. There are several reasons for this:
- The trial period of GCP is more than enough to go through the entire book
- It is a very mature product
- Kubernetes is a big part of GCP
You will learn the fundamentals of the DevOps tools and practices, which provide enough detail to allow you to search for extra information when needed but up to a point where you can use the learnings straight away in your company.
It will be strongly focused on the ops part of DevOps as there is enough literacy in application development, and that hasn't changed in the DevOps world. Needless to say, we are not going to show how to write tests for your application, which is a fundamental activity to ensure the stability of our systems: DevOps does not work without good code coverage and automated testing.
In general, the examples are simple enough to be followed by people at the entry level of DevOps, but if you want to go deeper into some aspects of GCP, there is a good collection of tutorials available at https://cloud.google.com/docs/tutorials.
The book is structured in an incremental way: first, the Docker fundamentals will be shown just after a walkthrough of the different cloud providers but before going deep into configuration management tools (specifically, Ansible) and containers' orchestration platform (mainly Kubernetes).
We will end up setting up a continuous delivery pipeline for a system that manages timezoned timestamps called Chronos, which I use for talks for several reasons:
- It has pretty much no business logic
- It is based on microservices
- It pretty much covers all the required infrastructure
You can find the code for Chronos on the following GitHub repository at https://github.com/dgonzalez/chronos.
The majority of the examples can be repeated in your local machine using a virtualization provider such as VirtualBox and MiniKube for the Kubernetes examples, but I'd encourage you to sign up for the trial on Google Cloud Platform as it provides you (at the time of writing this) with $300 or 1 year of resources to spend freely.
On this chapter we have seen how we should align our resources (engineers) to deliver low cost and high impact IT systems. We have seen how a poor communication can lead into a defective release process deadlocking our rollouts and making the system quite inefficient from the business point of view. Through the rest of the book, we are going to look at tools that can help us not only to improve this communication but also enable our engineers to deliver more top quality functionality with lower costs.
The first of these set of tools are described on the next chapter: the cloud data centers. These data centers allow us to create resources (VMs, networks, load balancers...) from their pool of resources in order to satisfy our needs of specific hardware, at a very reasonable price and flexibility. These type of cloud data centers are being adopted more and more by the modern (and not so modern) IT companies, which is leading to the creation of a set of tools to automate pretty much everything around the infrastructure.