Any company that builds and maintains applications is automatically concerned with repeatability, reliability, and scalability. In fact, some of the main metrics that are monitored on an application are directly related to these operational concerns. Understanding the basics and history of the industry when attempting to accomplish the ultimate trifecta of software administration is paramount to learning from the issues of the past.
In this chapter, and throughout this book, you will embark on a journey with a DevOps team as they attempt to conquer the deployment and delivery world. By experiencing the pains, bottlenecks, and setbacks with the DevOps team, you will understand how the industry has evolved, and what needs to be accomplished in order to succeed.
In this chapter, we're going to cover the following main topics:
- How did we get here?
- What is a deployment process?
- What is a delivery process?
- What makes any practice continuous?
How did we get here?
It's 8 a.m. on a Saturday and the release party's post-mortem has finally been completed. Throughout the release, every encountered issue resulted in a Root Cause Analysis process. Once each of the RCAs were done, the release team would then create and assign tickets as needed, resulting in action items for the different teams in the Engineering organization. With the post-mortem being completed, the release team can hand-off the monitoring of the production application to the weekend support team and head home.
The final production servers were upgraded with the new application release at around 3 a.m. that morning with all of the application health checks successfully passing by 3:30 a.m. And yet, even with the early morning finishing time, this is a significant improvement when comparing it to the release parties of a few years ago. Previously, the applications were released every 6 to 12 months, rather than the quarterly release cadence that the company is currently on.
Their company had hired a consulting agency to advise them on how to improve their application's mean-time-to-market and reduce their production outages in order to meet business initiatives and demands. The outcome suggested by this consulting agency was to release the application more frequently than once or twice a year. As a result, the releases have been quicker and less prone to error, which the business has taken notice of. The release parties still require pulling an all-nighter, but the previous release parties were more like all-weekenders or longer.
The on-call engineering team still has to be brought in for every release, but at least they aren't required to be a part of the release party for the entire time. And the most recent release only required a conference bridge for about 4 hours to solve issues with the underlying code or provide quick fixes. Overall, the operations team, infrastructure team, network team, and security team were able to solve most of the issues that showed up, which accounted for significantly more confidence in the newer release cadence.
The different teams should be able to accomplish the backlog of issues before the next release. But the team with the largest issue backlog were the systems administrators , who build, integrate, administrate, and troubleshoot the many different tools used during the releases.
After 12 straight hours with over 15 members across a host of different teams, the release party was complete. When considering the time associated with the attempt to improve the process throughout the quarter, as well as the actual release itself, it is not difficult to run the mental math on the associated costs. The teams need to figure out a way to make the releases more reliable, repeatable, and scalable.
This analogy is all too familiar for many who have been involved in the engineering side of a business during the Waterfall software development life cycle days. When applications were first made available as a SaaS (Software as a Service) solution,the common release cadence was an annual release. Throughout the year a company would deploy small release, often called patches, which mainly consisted of hardware, software, or security updates.
Since the yearly update was essentially releasing a brand-new product, the release process required significant involvement from every team across the entire engineering organization. The release was a major event , often taking an entire weekend, or longer, from every team available. Many in the industry had dubbed this event a release party. Each release party included significant amounts of caffeine and food, which accompanied a host of people hunched over their laptops as they watched the output of the release on a massive projector screen. Yet the worst part of this whole scenario was that this was the expected release style for every company at the time.
The quarterly release cadence was a novel idea that revolutionized how companies would develop and test their code. The code changes were smaller in nature and the teams evolved their thinking from a new product every year to a new subversion every quarter. Some user experience changes may be introduced, but most of the user experience in the application would remain the same from release to release. Another major benefit to the increased release frequency was the significant reduction in lead time, which is the time it takes to go from a feature being requested to being available in production.
Alongside the release parties were two very important processes when issues would arise during the release:
- Root-cause analysis (RCA)
An RCA would occur anytime there was a significant issue in production that would halt or severely affect the functionality or availability of the application. Often, the RCA process would start with the teams analyzing what was wrong, fixing the issue, validating that the fix worked, and then documenting how the issue arose and what the root cause was. Every release party would result in at least one RCA taking place, and would exponentially increase in number relative to the total amount of production servers involved in the release party.
The post-mortem was a retrospective process after the release was completed and the teams were confident in production operating as expected. The release captain would gather any and all information related to RCAs, bugs, errors, and so on, and create the required documentation and tickets. At the end of the post-mortem, a weekend support team would be briefed on the release party outcome and any items needing to be monitored.
The desire to automate the release of the application had been a central focus of every engineering organizations for years Automation was seen as the best way to enforce reliability and repeatability into the release process, and most of the common tools in use today were created with the intention and purpose of release automation. These tools, and really the underlying processes they address, intend to solve two major concepts in the software development life cycle: deliveries and deployments.
What is a deployment process?
10 p.m. on Friday was when everything started falling apart. The Q2 release party started a few hours ago with the entire operations team and a few members of the infrastructure and network team in attendance. Routing customer traffic away from the initial test server to allow for the upgrade went as expected. This process was recently automated through some network management scripts that the systems administration and network team worked on. The idea was that all new traffic should be routed away from the initial server while allowing the customer sessions that were currently using the server to continue until they disconnected. After all the user sessions were completed, the server was removed from the load balancer and the release process could start.
The infrastructure team had a bootstrap script already built out to automatically configure the server. Sometimes this process involved tearing down the whole server and rebuilding it, while other times the release required some simple software updates to be completed before the hardware was ready for the application release. The new release wouldn't require an entire rebuild of the server this time. However, since the last release was 3 months ago, they did have to patch the server, add a new application stack version, and make sure that other configuration requirements were set accordingly. The entirety of the infrastructure process took about an hour for the first server, which would then be repeated for the other servers so that the bootstrapping time would be reduced for the rest of the fleet. As more customers were acquired, the total number of servers in production had grown. To avoid downtime for the production environment these servers were grouped together into pools, which could then be individually targeted for stopping, upgrading, and restarting as needed.
After the initial server was bootstrapped by the infrastructure team and validated through some basic quality and security tests on the server, the operations team would then start the application release process. It was just after 7 p.m. when the operations team started the release process, also known as a deployment, by copying the ZIP file from the production network share to the server. The file was then expanded, into a mess of files and folders which contained system services, application files, and a rather daunting
INSTALL_README.txt file. This README file detailed all of the required install steps and validation checks that the engineering team documented for the operations team to execute.
With the install instruction file open on one screen and the terminal open on another, the install process could start. That is when everything went wrong.
Although the deployment testing in the staging environment had some issues because of missing requirements, those were documented and added to the install process. But what the operations team didn't know was that the server bootstrapping script had reset all of the network configuration files and all of the aplication traffic heading out of the server was being redirected back to itself. As the deployment went through, the application ZIP file was able to get pushed to the server, the filesystem was set up as needed, and the required system services began running. The script used to test the health of the application showed all successful log messages. However, when the script to test the interaction between the application and the database was run, the terminal output showed only connection errors. It took the team over an hour to get everything copied over, stood up, and tested before the network errors were discovered. The release party had come to a grinding halt.
The operations team was in full-on panic mode and the first RCA process had started. If they could not figure out why the server was not able to talk to external machines within the next hour, they would need to tear down the whole server and start over again. While one person from the operations team collaborated with the network and infrastructure team, another operations team member would retrace every action taken since the infrastructure team had finished their tasks. After 30 minutes of the network team analyzing all traffic related to the new server and the desired databases, they could not find any reason as to why the server could not reach the database. The Infrastructure team was checking to see if the server had been properly added to the domain and that no other machines were using the same hostname or IP address. The operations team had engaged the on-call engineering team and started a troubleshooting conference bridge for the data center support team to join.
It wasn't until a few minutes after midnight that the network team found the networking loopback issue on the server. The outcome of the RCA process found that the server bootstrapping script was the culprit, which was then altered to avoid the issue in the future. The server was now passing all health checks and the operations team could move on to the next server in the pool. Within an hour, the rest of the server pool had been fully upgraded without an issue. Almost two hours later, all server pools were upgraded and reporting healthy. The post-mortem process could begin now that the new application version was out in production and operating as expected.
A release party would always start with an initial test release into production, known as a deployment. At a high level, a deployment process solely concerned with is copying an artifact from a designated location to some endpoint or host. In the case of the quarterly release party, a deployment consisted of pushing or pulling the artifact to a designated test server in the production fleet. This was a common method used to avoid production downtime by preventing unknown production-specific nuances from negatively affecting an environment-wide deployment. The log output and application metrics for the test deployment were heavily scrutinized in an attempt to catch any hint of an issue.
The deployment on the test production server would typically require some bootstrapping process to enforce a common starting point for all future deployments. If all deployments started with the same server configuration starting point, then theoretically, nothing should be different as the deployments moved from one server to the next.
Once the deployment was complete, before traffic would be allowed on the application, a set of tests would be executed.These health checks would validate many different requirements, such as external service connectivity and buisness-critical functionality. After that initial production server was completed and the validation tests had passed, the rest of the initial server pool would be put through the same process. After the initial server pool was upgraded, it would be added to a load balancer with a set of load and smoke tests validating that the application, servers, and networking were operational.
Finally, after that initial server pool was completed, production traffic would then be routed appropriately. The next server pool in the queue would then follow the same process to test for deployment consistency. Once the release team had confidence in the process, they could upgrade multiple server pools in.
Even with a release party as intense as the one described happening on a quarterly basis, it was not the only release party for the engineering teams. Two other major deployment events would take place throughout the quarter: one for deployments of patches and hotfixes across the different architecture layers, , and the other for all non-production environments.
The first deployment that would happen throughout the quarter was when the engineering teams would need to test their code. The testing of the code started by packaging the code into an artifact and uploading it to a network share. Once the artifact was on the network share, the engineering team could then deploy their artifact to a designated server in the development environment. Usually, the engineer would have to run a bootstrapping script to reset the server to a desired state.
The process of deploying an artifact to a development environment had to be relatively repeatable , since the deployment frequency was either daily or weekly.. When the engineers believed that they had an artifact ready to be released, they would then deploy it to a test server in the Quality Assurance (QA) environment for further evaluation.
Along with all of the QA processes and testsneeding to be run, the team would need to start building out the
INSTALL_README.txt file for the production release. The QA team would send testing feedback to the developers for any required fixes or improvements. After a few rounds of feedback between the teams, the most recent artifact version would be promoted to a release candidate. The teams would then focus on the deployment process for the next release party. The handoff of the artifact from the developers to the operations team would happen about a month before the release party. Often described as "throwing it over the wall", the developers would have little to no interaction with the artifact once it was passed on the operations. The operations team would then spend the next month practicing the deployment for the release party.
The other major deployment events taking place throughout the quarter were the patching and hotfix releases.
Similar to the development deployment process, the patching deployment process would be executed against lower-level environments for both testing and repeatability. The major difference was that these deployments would take place outside of typical maintenance windows.
The initial set of releases would start with the development environment, allowing for significant testing to take place. This would prevent regressions from affecting higher-level environments, such as QA or production. Once the deployments to the lower-level environments were repeatable, the teams would designate an evening or weekendfor the deployment to take place. Similar to the release party, one server would be removed, patched, restarted checked and then made available for users. Assuming everything behaved as expected with the patched application, the rest of the servers would be put through the same set of tasks.
A deployment is an essential process focused on getting an artifact into an environment. The more a deployment process can be automated, the more repeatable, reliable, and scalable the deployment process becomes.
What is a delivery process?
The time between the release parties seemed to get shorter with each quarter, especially when the requirements continued to grow. The different engineering groups were becoming smaller as engineers were burning out and leaving. The system administrator team needed to quickly figure out what had to be fixed and what could be automated. But after every release party, the backlog of work items continued to pile up, resulting in a mountain of technical debt.
The recent move from semi-annual releases to quarterly releases required a massive operations team overhaul. Some of the more senior infrastructure and engineering team members moved to a new group focused on automating the release process. But before they could automate anything, they needed to understand what a release actually consisted of. Since no team or person knew all of the different release requirements, the next quarter was spent on research and documentation.
What they found was that every step related to the lifecycle of the artifact, from development to production release, needed to be automated. This lifecycle, which included all pre-deployment, deployment, and post-deployment requirements, was considered a delivery. What they were surprised by was finding out that one execution of the delivery process took 3 months to complete. To make matters worse, every step in the delivery process had never been documented or defined. Every member of every team that was involved in the delivery process had a different reason for why each of their steps were required.
After the team had completed their initial documentation process and also after experiencing the recent release party, they were ready to start automating. But what they were unsure of was whether or not they should build out the process or if they should look for a tool to do it for them.
At first, they researched the available tools in the market that might give them a foundation to build the new process on top of. One issue they found was that the more common tools were related specifically to either Windows or Linux, but not both. The other tools that they had found were scalable across the different systems, but they required significant ownership and hosting requirenments. Considering the short timeline and technology requirements, any tool that supported multiple systems and could be highly customized and extended through scripting would be best.
The system administrator team decided that it would be best to split up and tackle different requirements. Some of the team focused their attention on running tooling proof of concepts. The rest of the team would focus on building scripts to support the rest of the engineering organization. The initial iteration of the new deployment process would be focused mainly on the ability to build and execute the automation scripts. Once that was built out, the next iteration improvement would focus on turning the scripts into templates for scalability.
The first piece to automate was the deployment of the release candidate artifact to the test server. This required bootstrapping the server (resetting it to an optimal state, adding the desired environment variables, adding any required software or patches, and so on). Then they would pull down the artifact from a designated network share, expand it, and upgrade the application on the server. After that was completed, the automation process could then email the QA team for them to start their testing requirements.
With the development deployment completed, automation process would be directed at a staging environment. This environment contained multiple production-like servers , allowing the operations team to practice the deployment process. Automating this requirement, meant that the deployment scripts had to affect the network configuration, as well as the application server. After the server was bootstrapped and reset, the testing process could then be run to validate the application health. But, to make the automation more scalable, the team would need to have the scripts run remotely, pushing down that artifact instead. The remote execution behavior would allow for a larger deployment set to be run in parallel.
The last part that the team wanted to automate was the post-deployment verification step. This step could be run remotely and would pass data to the application. This would allow for both a network connectivity check and a desired functionality check.
The team would need to test out the automation process in production
One of the biggest issues that any engineering organization must deal with is technical debt. Technical debt is the cost of any rework that is caused by pursuing easy or limited solutions. And what makes technical debt grow is when engineering organizations work as a set of disparate units. This causes compound interest, since no central team will be able to maintain the cross-team technical debt.
ventually, the creditor comes to collect the technical debt and, depending on the communication styles of the teams and how long the debt has been avoided, the hole is almost too deep to climb out of. Technical will often go unnoticed for a few years few years until the creditor comes to collect. With regard to a engineering organization's technical debt, the creditor is often business or market requirements. Most recently, smaller and more agile start-ups disrupt the market, taking market share, and causing panic for the bigger players.
With the potential of technical debt resulting in a form of technical bankruptcy, many companies make radical decisions. Sometimes they decide to create new teams focused on a new company direction. Other times they will replace the management team for a "fresh perspective". But in any case, the goal is to repay the technical debt as quickly as possible.
A common place to find technical debt is in engineering supporting processes. For the system administration team in the analogy, most of the technical debt was associated to their release practice. Although they had a relatively automated deployment, they found that most of the manual steps were things that occurred before and after. As as result, the team realized that the biggest source of technical debt was their delivery process.
Any desire to automate a process must first start with a requirements gathering process. With the requirements gathered, a team can then pursue a minimally viable product. Part of the requirements gathering process is being able to define what the immediate needs were and which capabilities could be added at a later time. A minimally viable product is exactly as it sounds, a product that meets minimal requirements to be viable. In the analogy, the items that would be required in the MVP were server bootstrapping, artifact deployment, and network management. These functionalities would have the highest level of impact on the technical debt and also on the main problem areas throughout the current delivery and deployment process. Features such as running and evaluating tests, approval steps, dynamic environment creation, and traffic manipulation would be brought in over time.
Building, testing, and iterating are the common development cycles that any engineering team will need to go through. But the moment that any process is automated, the team responsible for the automation will need to consider scalability. Once another another automation use case is discovered, the automation tooling must be scaled to accomodate. The term often associated with scaling across use cases is known as onboarding. And a requirement to onboard other use cases or teams immediately develops a need for a central management team. That team will have the goal of supporting, improving, troubleshooting, and onboarding the solution for the foreseable future. Eventually, the automation process becomes a core support tool that must be reliable, scalable, and offer repeatable outcomes.
What makes any practice continuous?
It's been about 2 years since the quarterly releases were mandated, and one year since the systems administration team had first been formed. The most recent release party resulted in a very low impact release. The engineering organization was able to significantly reduce the years of ignored technical debt. The company's product roadmap, promising a new product every year, recently went into effect. As a result, the automated delivery pipeline was now supporting two separate products in a repeatable, reliable, and scalable way.
A design choice that the systems administration team had pursued was building out an imperative configuration method for the automation solution. Looking back, the team realized that they would have had an easier time scaling through declarative methods instead. The need to quickly move away from the previous high-touch process drove the team to make some rushed decisions. Although variables could be added to the execution process, every new product required the team to clone and change the automation scripts to be successful. The new automation solution could only scale through this cloning process, resulting in heavy administration and storage costs.
The system administration team realized that for a tool to support their scaling requirements , the tool would need to support declarative configurations and executions. This requirement became abundantely clear as the architecture support requirements grew and changed. For now, the team could convert their scripts into more declarative templates to implement a short-term scalable solution. The system administrator team needed to get a new process in place, and fast. The previous process was error prone and highly manual, resulting in a mountain of technical debt. But to develop the MVP in time, the team took a risk of piling up technical debt of their own. They assumed that they would have enough time to fix and refactor the solution to pay off the technical debt. But this time, the creditor came back much sooner than expected.
Business executives were impressed with the lack of issues relating to the recent product launches. The reliability of their main product was better than ever. In fact, these recent changes resulted in the company gaining significant market share.The success of the two products in the market meant that the company could double their efforts. And to ensure that the recent success would continue, the company executives decided to hire a new Chief Digital Officer. This new CDO was well known in the industry for implementing signficant change in engineering. One of the major changes that the CDO was bringing to the company was adopting DevOps practices.
The following month saw a host of changes across all of engineering. Every team was now required to attend DevOps training and enablement. Anyone in engineering leadership was required to read a lengthy handbook on DevOps. The different teams were now also being tasked with documenting their current process, as well as anything that they were working on. Each team would also be incentivized to learn about Git, containerization, and different continuous practices.
The infrastructure, network, and security teams were tasked with learning about containers, container orchestration, and cloud infrastructure. The operations team became the DevOps team, and the system administration team became site reliability engineers.
These changes required the teams to migrate from current process to DevOps practices. The development team had more time granted for their migration requirements. But the DevOps and SRE teams were required to rapidly migrate current platforms over to cloud native ones.
The significant shift in direction towards DevOps and cloud native technology resulted in a major staff change. Some of the more senior engineers left the company, while a host of new hires brought in fresh perspectives. The goal was to get the company out of the Waterfall software development life cycle method and into the continuous world of DevOps. The CDO wanted an integration, deployment, and delivery process that was executed at least once a day.
Many companies that existed at the time when the Waterfall software development life cycle was the industry best practice have been confronted with this need to change. The shift from Waterfall methods to Agile, and now to DevOps has rocked the engineering industry. The ability to execute a delivery or deployment process at any time seemed too risky. The perspective was that only the largest companies with the most money and the largest engineering staff could achieve these extreme capabilties.
The DevOps and SRE teamsrealized that the best way to support the new DevOps requirements was to rebuild their automation solution. They would need to set up a best practice process for the developers to use. This included the tools, solutions, platforms, and steps needed to enable continuous delivery and deployments.
Different members of the DevOps and SRE teams would still need to maintain the old process in a weekly rotation schedule. Others on the team would work on setting up the new platform in the desired cloud provider and learning the steps to get an artifact there. After choosing a container orchestration platform, the DevOps and SRE teams needed to work on automation. Configuring and scaling the new platform required the teams to learn how to use Infrastructure as Code. A declarative method of delivery and deployment is one of the most scalable options for a DevOps practice. Most major platforms today natively support declarative configuration practices. This support allows for teams to easily adopt the platform and scale it out to meet their needs.
After the platform was set up and an artifact was deployed to it, the teams started looking at different templating options to make the administration requirements light. This led to a bake-off across different declarative deployment styles, some natively built into the platform and others leveraging an underlying templating engine. As the teams got closer to the desired end state, they had to find a new tool that would enable and assist them in the cloud-native world. The two main questions they needed to answer now were the following:
- Do they need to support the old applications as well as the new?
- Should they look for a continuous deployment tool or a continuous delivery tool?
This chapter presented a foundational understanding of what a delivery requires, what a deployment is, and how the industry arrived at the state it currently operates in.
A common misunderstanding in the industry is that if the company is using a continuous integration and/or continuous delivery tool, or if their application is releasing at least once a week, they are operating in a continuous manner. This chapter laid out that a continuous process is defined as one that executes at least once a day, preferably more.
In the next chapter, the DevOps team will explore common industry trends and tools related to software delivery and deployment, as well as how to test their automation process.