Software technology has evolved much like life on Earth. In the beginning, websites were static and programming languages were primitive, just like those simple multicellular organisms in the ancient oceans. In those times, software solutions were intended only for a few large organizations. Then, in the early 90s, the popularity of the Internet led to a rapid growth in various new programming languages and web technologies. And all of a sudden, there was a Cambrian-like explosion in the domain of information technology that brought up diversity in software technologies and tools. The growth of the Internet, powered by dynamic websites running on HTML languages, changed the way information was displayed and retrieved.
This continues to date. In recent years, there has been an immense demand for software solutions in many big and small organizations. Every business wants to venture its product online, either through websites or apps. This huge need for economical software solutions has led to the growth of various new software development methodologies that make software development and its distribution quick and easy. An example of this is the extreme programming (XP), which attempted to simplify many areas of software development.
Large software systems in the past relied heavily on documented methodologies, such as the waterfall model. Even today, many organizations across the world continue to do so. However, as software engineering continues to evolve, there is a shift in the way software solutions are being developed and the world is going agile.
Understanding the concepts of Continuous Integration is our prime focus in the current chapter. However, to understand Continuous Integration, it is first important to understand the prevailing software engineering practices that gave birth to it. Therefore, we will first have an overview of various software development processes, their concepts, and implications. To start with, we will first glance through the agile software development process. Under this topic, we will learn about the popular software development process, the waterfall model, and its advantages and disadvantages when compared to the agile model. Then, we will jump to the Scrum framework of software development. This will help us to answer how Continuous Integration came into existence and why it is needed. Next, we will move to the concepts and best practices of Continuous Integration and see how this helps projects to get agile. Lastly, we will talk about all the necessary methods that help us realize the concepts and best practices of Continuous Integration.
The name agile rightly suggests quick and easy. Agile is a collection of software development methodologies in which software is developed through collaboration among self-organized teams. Agile software development promotes adaptive planning. The principles behind agile are incremental, quick, and flexible software development.
For most of us who are not familiar with the software development process itself, let's first understand what the software development process or software development life cycle is.
Software development process, software development methodology, and software development life cycle have the same meaning.
Software development life cycle, also sometimes referred to as SDLC in brief, is the process of planning, developing, testing, and deploying software. Teams follow a sequence of phases, and each phase uses the outcome of the previous phase, as shown in the following diagram:
First, there is a requirement analysis phase: here, the business teams, mostly comprising business analysts, perform a requirement analysis of the business needs. The requirements can be internal to the organization or external from a customer. This analysis includes finding the nature and scope of the problem. With the gathered information, there is a proposal either to improve the system or to create a new one. The project cost is also decided and benefits are laid out. Then, the project goals are defined.
The second phase is the design phase. Here, the system architects and the system designers formulate the desired features of the software solution and create a project plan. This may include process diagrams, overall interfaces, layout designs, and a huge set of documentation.
The third phase is the implementation phase. Here, the project manager creates and assigns tasks to the developers. The developers develop the code depending on the tasks and goals defined in the design phase. This phase may last from a few months to a year, depending on the project.
The fourth phase is the testing phase. Once all the decided features are developed, the testing team takes over. For the next few months, there is a thorough testing of all the features. Every module of the software is brought into one place and tested. Defects are raised if any bugs or errors erupt while testing. In the event of failures, the development team quickly actions on it. The thoroughly tested code is then deployed into the production environment.
One of the most famous and widely used software development processes is the waterfall model. The waterfall model is a sequential software development process. It was derived from the manufacturing industry. One can see a highly structured flow of processes that run in one direction. In those times, there were no software development methodologies, and the only thing the developers could have imagined was the production line process, which was simple to adapt for software development. The following diagram illustrates the sequence steps in the waterfall model:
The waterfall approach is simple to understand. The steps involved are similar to the ones discussed for the software development life cycle.
First, there is the requirement analysis phase followed by the designing phase. There is considerable time spent on the analysis and the designing part. And once it's over, there are no further additions or deletions. In short, once the development begins, there is no modification allowed in the design.
Then, comes the implementation phase where the actual development takes place. The development cycle can range from 3 months to 6 months. During this time, the testing team is usually free. Once the development cycle is completed, a whole week's time is planned for performing the integration and release of the source code in the testing environment. During this time, many integration issues pop up and are fixed at the earliest opportunity.
Once the testing starts, it goes on for another three months or more, depending on the software solution. After the testing completes successfully, the source code is deployed in the production environment. For this, again a day or two is planned to carry out the deployment. There is a possibility that some deployment issues may pop up.
After this, the software solution goes live. The teams get feedback and may also anticipate issues. The last phase is the maintenance phase. In this phase, the development team works on the development, testing, and release of software updates and patches, depending on the feedback and bugs raised by the customers.
There is no doubt that the waterfall model has worked remarkably well for decades. However, flaws did exist, but they were simply ignored for a long time because, way back then, software projects had an ample amount of time and resources to get the job done.
However, looking at the way software technologies have changed in the past few years we can easily say that this model won't suit the requirements of the current world.
Working software is produced only at the end of the software development life cycle, which lasts for a year or so in most of the projects.
There is a huge amount of uncertainty.
This model is not suitable for projects based on object-oriented programming languages, such as Java or .NET.
This model is not suitable for projects where changes in the requirements are frequent. For example, e-commerce websites.
Integration is done after the complete development phase is over. As a result, teams come to know about the integration issues at a very later stage.
It's difficult to measure progress within stages.
The requirements are well-documented and fixed.
There is enough funding available to maintain a management team, testing team, development team, build and release team, deployment team, and so on.
The technology is fixed and not dynamic.
There are no ambiguous requirements. And most importantly, they don't pop up during any other phase apart from the requirement analysis phase.
The agile software development process is an alternative to the traditional software development processes, as discussed earlier. The following are the 12 principles on which the agile model is based:
Customer satisfaction by early and continuous delivery of useful software
Welcome changing requirements, even late in development
Working software is delivered frequently (in weeks rather than months)
Close daily cooperation between business people and developers
Projects are built around motivated individuals, who should be trusted
Face-to-face conversation is the best form of communication (co-location)
Working software is the principal measure of progress
Sustainable development that is able to maintain a constant pace
Continuous attention to technical excellence and good design
Simplicity—the art of maximizing the amount of work not done—is essential
Regular adaptation to changing circumstances
The 12 agile principles are taken from http://www.agilemanifesto.org.
In the agile software development process, the whole software is broken into many features or modules. These features or modules are delivered in iterations. Each iteration lasts for 3 weeks and involves cross-functional teams that work simultaneously in various areas, such as planning, requirements analysis, design, coding, unit testing, and acceptance testing. As a result, there is no single person sitting idle at any given point of time whereas in the waterfall model, while the development team is busy developing the software, the testing team, the production support team and everyone else is either idle or underutilized.
You can see, in the preceding diagram, that there is no time spent on the requirement analysis or design. Instead, a very high-level plan is prepared, just enough to outline the scope of the project.
The team then goes through a series of iterations. Iterations can be classified as time frames, each lasting for a month, or even a week in some mature projects. In this duration, a project team develops and tests features. The goal is to develop, test, and release a feature in a single iteration. At the end of the iteration, the feature goes for a demo. If the clients like it, then the feature goes live. If it gets rejected, the feature is taken as a backlog, reprioritized and again worked upon in the consecutive iteration.
There is also a possibility for parallel development and testing. In a single iteration, you can develop and test more than one feature in parallel.
Functionality can be developed and demonstrated rapidly: In an agile process, the software project is divided on the basis of features and each feature can be called as a backlog. The idea is to develop a single or a set of features right from its conceptualization until its deployment, in a week or a month. This puts at least a feature or two on the customer's plate, which they can start using.
Resource requirement is less: In agile, there is no separate development team and testing team. There is neither a build or release team or deployment team. In agile, a single project team contains around eight members, and each individual in the team is capable of doing everything. There is no distinction among the team members.
Promotes teamwork and cross training: As mentioned earlier, since there is a small team of about eight members, the team members in turn switch their roles and learn about each other's experience.
Suitable for projects where requirements change frequently: In the agile model of software development, the complete software is divided into features and each feature is developed and delivered in a short span of time. Hence, changing the feature, or even completely discarding it, doesn't affect the whole project.
Minimalistic documentation: This methodology primarily focuses on delivering working software quickly rather than creating huge documents. Documentation exists, but it's limited to the overall functionality.
Little or no planning required: Since features are developed one after the other in a short duration of time, hence, there is no need for extensive planning.
One of the widely-used agile software development methodologies is the Scrum framework. Scrum is a framework used to develop and sustain complex products that are based on the agile software development process. It is more than a process; it's a framework with certain roles, tasks, and teams. Scrum was written by Ken Schwaber and Jeff Sutherland; together they created the Scrum guide.
In a Scrum framework, the development team decides on how a feature needs to be developed. This is because the team knows best how to solve the problem they are presented with. I assume that most of the readers are happy after reading this line.
Scrum relies on a self-organizing and cross-functional team. The Scrum team is self-organizing; hence, there is no team leader who decides which person will do which task or how a problem will be solved. In Scrum, a team is cross-functional, which means everyone takes a feature from an idea to implementation.
Sprint: Sprint is a time box during which a usable and potentially releasable product increment is created. A new sprint starts immediately after the conclusion of the previous sprint. A sprint may last for 2 weeks to 1 month, depending on the projects' command over Scrum.
Product backlog: The product backlog is a list of all the required features in a software solution. This list is dynamic, that is, every now and then the customers or team members add or delete items to the product backlog.
The development team: The development team does the work of delivering a releasable set of features named increment at the end of each sprint. Only members of the development team create the increment. Development teams are empowered by the organization to organize and manage their own work. The resulting synergy optimizes the development team's overall efficiency and effectiveness.
The product owner: The product owner is a mediator between the Scrum team and everyone else. He is the face of the Scrum team and interacts with customers, infrastructure teams, admin teams, and everyone involved in the Scrum, and so on.
The product owner, the Scrum master, and the Scrum team together follow a set of stringent procedures to quickly deliver the software features. The following diagram explains the Scrum development process:
Sprint planning is an opportunity for the Scrum team to plan the features in the current sprint cycle. The plan is mainly created by the developers. Once the plan is created, it is explained to the Scrum master and the product owner. The sprint planning is a time-boxed activity, and it is usually around 8 hours in total for a 1-month sprint cycle. It is the responsibility of the Scrum Master to ensure that everyone participates in the sprint planning activity, and he is also the one to keep it within the time box.
Number of product backlogs to be worked on (both new and the old ones coming from the last sprint)
The teams' performance in the last sprint
Projected capacity of the development team
During the sprint cycle, the developers simply work on completing the backlogs decided in the sprint planning. The duration of a sprint may last from two weeks to one month, depending on the number of backlogs.
This activity happens on a daily basis. During the scrum meeting, the development team discusses what was accomplished yesterday and what will be accomplished today. They also discuss the things that are stopping them from achieving their goal. The development team does not attend any other meetings or discussions apart from the Scrum meeting.
The daily scrum is a good opportunity for a team to measure the progress of the project. The team can track the total work that is remaining, and using it, they can estimate the likelihood of achieving the sprint goal.
The sprint review is like a demo to the customers regarding what has been accomplished and what they were unable to accomplish. The development team demonstrates the features that have been accomplished and answers the questions based on the increment. The product owner updates the product backlog list status till date. The product backlog list may be updated, depending on the product performance or usage in the market. The sprint review is a four-hour activity in total for a one month sprint.
In this meeting, the team discusses the things that went well and the things that need improvement. The team then decides the points on which it has to improve to perform better in the upcoming sprint. This meeting usually occurs after the sprint review and before the sprint planning.
Integration is the act of submitting your personal work (modified code) to the common work area (the potential software solution). This is technically done by merging your personal work (personal branch) with the common work area (Integration branch). Continuous Integration is necessary to bring out issues that are encountered during the integration as early as possible.
This can be understood from the following diagram, which depicts various issues encountered during a software development lifecycle. I have considered a practical scenario wherein I have chosen the Scrum development model, and for the sake of simplicity, all the meeting phases are excluded. Out of all the issues depicted in the following diagram, the following ones are detected early when Continuous Integration is in place:
Build failure (the one before integration)
Build failure (the one after integration)
In the event of the preceding issues, the developer has to modify the code in order to fix it. A build failure can occur either due to an improper code or due to a human error while doing a build (assuming that the tasks are done manually). An integration issue can occur if the developers do not rebase their local copy of code frequently with the code on the Integration branch.
In the preceding diagram, I have considered only a single testing environment for simplicity. However, in reality, there can be as many as three to four testing environments.
In any development team, there are a number of developers working on a set of files at any given point of time. Imagine that the software code is placed at a centralized location using a version control system. And developer "A" creates a branch for himself to work on a code file that prints some lines. Let's say the code when compiled and executed, prints "Hello, World".
# Print a message. Print "Hello, World\n";
After creating a branch, developer "A" checks out the file and modifies the following code:
# Print a message. Print "Hello, Readers\n";
He then checks in the file, and after check-in, he performs a build. The code is compiled, and the unit testing results show positive.
Nevertheless, if the unit tests were to fail, the developer would have returned to the code, checked for errors, modified the code, built it again and again until the compilation and unit test show positive. This following diagram depicts the scenario that we discussed so far.
Assume that our developer "A" gets busy with some other task and simply forgets to deliver his code to the Integration branch or he plans to do it later. While the developer is busy working in isolation, he is completely unaware of the various changes happening to the same code file on the Integration branch. There is a possibility that some other developer, say developer "B," has also created a private branch for himself and is working on the same file.
# Print a message. print "Hello, World!\n"; print "Good Morning!\n";
After the modification, developer "B" compiles and unit tests the code, and then, he integrates the code on the Integration branch, thus creating a new version "2".
Now after a week of time, at the end of the sprint, the developer "A" realizes that he has not integrated his code into the Integration branch. He quickly makes an attempt to, but to his surprise, he finds merge issues (in most cases, the merge is successful, but the code on the Integration branch fails to compile due to an integration issue).
What do we make out of this? If developer "A" had immediately rebased and integrated his changes with the changes on the Integration branch (Continuous Integration), then he would have known about the merge issues far in advance and not at the end of the sprint. Therefore, developers should integrate their code frequently with the code on the Integration branch.
Since you're integrating frequently, there is significantly less back-tracking to discover where things went wrong.
If you don't follow a continuous approach, you'll have longer periods between integrations. This makes it exponentially more difficult to find and fix problems. Such integration problems can easily knock a project off schedule or can even cause it to fail altogether.
The agile software development process mainly focuses on faster delivery, and Continuous Integration helps it in achieving that speed. Yet, how does Continuous Integration do it? Let's understand this using a simple case.
Developing a feature may involve a lot of code changes, and between every code change, there can be a number of tasks, such as checking in the code, polling the version control system for changes, building the code, unit testing, integration, building on integrated code, packaging, and deployment. In a Continuous Integration environment, all these steps are made fast and error-free using automation. Adding notifications to it makes things even faster. The sooner the team members are aware of a build, integration, or deployment failure, the quicker they can act upon it. The following diagram clearly depicts all the steps involved in code changes:
The amount of code written for the embedded systems present inside a car is more than that present inside a fighter jet. In today's world, embedded software is inside every product, modern or traditional. Cars, TVs, refrigerators, wrist watches, and bikes all have little or more software dependent features. Consumer products are becoming smarter day by day. Nowadays, we can see a product being marketed more using its smart and intelligent features than its hardware capability. For example, an air conditioner is marketed by its wireless control features, TVs are being marketed by their smart features, such as embedded web browsers, and so on.
The need to market new products has increased the complexity of products. This increase in software complexity has brought agile software development and Continuous Integration methodologies into the limelight. Though, there were times when agile software development was used by a team of not more than 30-40 people, working on a simple project. Almost all types of projects benefit from Continuous Integration. Mostly the web-based projects, for example, e-commerce websites and mobile phone apps.
Continuous Integration, automation, and agile are mostly thought to be used in projects that are based on Java, .NET, and Ruby on Rails. The only place where you will see it's not used are the legacy systems. However, even they are going agile. Projects based on SAS, Mainframe, and Perl are all now using Continuous Integration in some ways.
A tool such as Jenkins works in collaboration with many other tools to achieve Continuous Integration. Let's take a look at some of the best practices of Continuous Integration.
In a Continuous Integration world, working in a private work area is always advisable. The reason is simple: isolation. One can do anything on their private branch or to simply say with their private copy of the code. Once branched, the private copy remains isolated from the changes happening on the mainline branch. And in this way, developers get the freedom to experiment with their code and try new stuff.
If the code on developer A's branch fails due to some reason, it will never affect the code present on the branches belonging to the other developers. Working in a private workspace either through branching or through cloning repos is a great way to organize your code.
For example, let's assume that a bug fix requires changes to be made to the
C.java files. So, a developer takes the latest version of the files and starts working on them. The files after modification are let's say version 56 of the
A.java file, version 20 of the
B.java file, and version 98 of the
C.java file. The developer then creates a package out of those latest files and performs a build and then performs a test. The build and testing run successfully and everything is good.
Now consider a situation where after several months, another bug requires the same changes. The developer will usually search for the respective files with particular versions that contain the code fix. However, these files with the respective versions might have been lost in the huge oceans of versions by now.
Instead, it would have been better if the file changes were brought to a separate branch long back (with the branch name reflecting the defect number). In this way, it would have been easy to reproduce the fix using the defect number to track the code containing the required fix.
We all know about the time dilation phenomena (relativity). It is explained with a beautiful example called the twin paradox, which is easy to understand but hard to digest. I have modified the example a little bit to suit our current topic. The example goes like this; imagine three developers: developers A, B, and C. Each developer is sent into space in his own spacecraft that travels at the speed of light. All are given atomic clocks that show exactly the same time. Developer B is supposed to travel to planet Mars to sync the date and time on a computer, which is on Mars. Developer C is supposed to travel to Pluto for a server installation and to sync the clock with that of Earth.
Developer A has to stay on Earth to monitor the communication between the server that is present on Earth with the servers on Mars and Pluto. So, all start at morning 6 AM one fine day.
After a while, developers B and C finish their jobs and return to Earth. On meeting each other, to their surprise, they find their clocks measuring a different time (of course, they find each other aged differently). They all are totally confused as to how this happened. Then, developer A confirms that all the three servers that are on Earth, Mars, and Pluto, respectively are not in sync.
Then, developer A recalls that while all the three atomic clocks were in sync back then on Earth, they forgot to consider the time dilation factor. If they would have included it keeping in mind the speed and distance of travel, the out-of-sync issue could have been avoided.
This is the same situation with developers who clone the Integration branch and work on their private branch, each one indulging in their own assignment and at their own speed. At the time of merging, each one will definitely find their code different from the others and the Integration branch, and will end up with something called as Merge Hell. The question is how do we fix it? The answer is frequent rebase.
In the previous example (developers with the task of syncing clocks on computers located across the solar system), the cause of the issue was to neglect the "time dilation factor". In the latter example (developers working on their individual branch), the cause of the issue was neglecting the frequent rebase. Rebase is nothing but updating your private branch with the latest version on the Integration branch.
While working on a private repository or a private branch surely has its advantages; it also has the potential to cause lots of merge issues. In a software development project containing 10 to 20 developers, each developer working by creating a private clone of the main repository completely changes the way the main repository looked over time.
In an environment where code is frequently merged and frequently rebased, such situations are rare. This is the advantage of using continuous integration. We integrate continuously and frequently.
The other situations where rebasing frequently helps are:
You branched out from a wrong version of the integration branch, and now you have realized that it should have been version 55 and not 66.
You might want to know the merge issues that occur when including code developed on some other branch belonging to a different developer.
Also, too much merging messes up the history. So rather than frequently merging, it's better to rebase frequently and merge less. This trick also works in avoiding merge issues.
While frequent rebase means less frequent merges on the Integration branch, which, in turn, means less number of versions on the Integration branch and more on the private, there is an advantage. This makes the integration clear and easy to follow.
While rebase should be frequent, so should check-in, at least once a day on his/her working branch. Checking in once a week or more is dangerous. The one whole week of code that is not checked-in runs the risk of merge issues. And these can be tedious to resolve. By committing or merging once a day, conflicts are quickly discovered and can be resolved instantly.
Continuous Integration tools need to make sure that every commit or merge is built to see the impact of the change on the system. This can be achieved by constantly polling the Integration branch for changes. And if changes are found, build and test them. Afterwards quickly share the results with the team. Also, builds can run nightly. The idea is to get instant feedback on the changes they have made.
While a continuous build can give instant feedback on build failures, continuous testing, on the other hand, can help in quickly deciding whether the build is ready to go to the production. We should try to include as many test cases as we can, but this again increases the complexity of the Continuous Integration system. Tests that are difficult to automate are the ones that reflect the real-world scenarios closely. There is a huge amount of scripting involved and so the cost of maintaining it rises. However, the more automated testing we have, the better and sooner we get to know the results.
How can we do that? The answer is simple; before checking in your code, perform a build on your local machine, and if the build breaks, do not proceed with the check-in operation. There is another way of doing it. The version control system can be programmed to immediately trigger a build using the Continuous Integration tool, and if the tool returns positive results, only then the code is checked-in. Version control tools, such as TFS have a built in feature called a gated check-in mechanism that does the same.
There are other things that can be added to the gated check-in mechanism. For example, you can add a step to perform a static code analysis on the code. This again can be achieved by integrating the version control system with the Continuous Integration tool, which again is integrated with the tool that performs a static code analysis. In the upcoming chapters, we will see how this can be achieved using Jenkins in collaboration with SonarQube.
In many organizations, there is a separate team to perform deployments. The process is as follows. Once the developer has successfully created a build, he raises a ticket or composes a mail asking for a deployment in the respective testing environment. The deployment team then checks with the testing team if the environment is free. In other words, can the testing work be halted for a few hours to accommodate a deployment? After a brief discussion, a certain time slot is decided and the package is deployed.
The deployment is mostly manual and there are many manual checks that take the time. Therefore, for a small piece of code to go to the testing environment, the developer has to wait a whole day. And if for some reasons, the manual deployment fails due to a human error or due to some technical issues, it takes a whole day in some cases for the code to get into the testing area.
This is a painful thing for a developer. Nevertheless, this can be avoided by carefully automating the deployment process. The moment a developer tries to check-in the code, it goes through an automated compilation check, then it goes through an automated code analysis, and then it's checked-in to the Integration branch. Here the code is again picked along with the latest code on the Integration branch and then built. After a successful build, the code is automatically packaged and deployed in the testing environment.
In my experience, some of the best practices of Continuous Integration are the same as those of software configuration management. For example, labels and baselines. While both are similar technically, they are not the same from the usage perspective. Labeling is the task of applying a tag to a particular version of a file or a set of files. We take the same concept a little bit further. For example, what if I apply a label to particular versions of all the files? Then, it would simply describe a state of the whole system. A version of the whole collective system. This is called a baseline. And why it is important?
Labels or baselines have many advantages. Imagine that a particular version of your private code fixed a production issue, say "defect number 1234". You can label that version on your private code as the defect number for later use. Labels can also be used to mark sprints, releases, and hotfixes.
The one that is used widely is shown in the following image:
Here, the first two digits are the release numbers. For example, 00 can be beta, 01 can be alpha, and 02 can represent the commercial release. The next two digits are the bug fix numbers. Let's say release 02.00.00 is in production and few bugs or improvements arise, then the developer who is working on fixing those issues can name his branch or label his code as 02.01.00.
Similarly, consider another scenario, where the release version in production is 03.02.00, and all of a sudden something fails and the issue needs to be fixed immediately. Then, the release containing the fix can be labeled as 03.02.01, which says that this was a hotfix on 03.02.00.
They say communication is incomplete without feedback. Imagine a Continuous Integration system that has an automated build and deployment solution, a state-of-the-art automated testing platform, a good branching strategy, and everything else. However, it does not have a notification system that automatically emails or messages the status of a build. What if a nightly build fails and the developers are unaware of it?
What if you check-in code and leave early, without waiting for the automated build and deployment to complete? And the next day you find that the build failed due to a simple issue, which occurred just 10 minutes after you departed from the office.
Therefore, instant notifications are important. All the Continuous Integration tools have it, including Jenkins. It is good to have notifications of build failures, deployment failures, and testing results. We will see in the upcoming chapters how this can be achieved using Jenkins and the various options Jenkins provides to make life easy.
Implementing Continuous Integration involves using various DevOps tools. Ideally, a DevOps engineer is responsible for implementing Continuous Integration. But, who is a DevOps engineer? And what is DevOps?
Build and release management
Version control system administration
Software configuration management
All sorts of automation
Implementing continuous integration
Implementing continuous testing
Implementing continuous delivery
Implementing continuous deployment
Cloud management and virtualization
I assume that the preceding tasks need no explanation. A DevOps engineer accomplishes the previously mentioned tasks using a set of tools; these tools are loosely called DevOps tools (Continuous Integration tools, agile tools, team collaboration tools, defect tracking tools, continuous delivery tools, cloud management tools, and so on).
A DevOps engineer has the capability to install and configure the DevOps tools to facilitate development operations. Hence, the name DevOps. Let's see some of the important DevOps activities pertaining to Continuous Integration.
This is the most basic and the most important requirement to implement Continuous Integration. A version control system, or sometimes it's also called a revision control system, is a tool used to manage your code history. It can be centralized or distributed. Two of the famously centralized version control systems are SVN and IBM Rational ClearCase. In the distributed segment, we have tools such as Git. Ideally, everything that is required to build software must be version controlled. A version control tool offers many features, such as labeling, branching, and so on.
When using a version control system, keep the branching to the minimum. Few companies have only one main branch and all the development activities happening on that. Nevertheless, most companies follow some branching strategies. This is because there is always a possibility that part of a team may work on a release and others may work on another release. At other times, there is a need to support older release versions. Such scenarios always lead companies to use multiple branches.
For example, imagine a project that has an Integration branch, a release branch, a hotfix branch, and a production branch. The development team will work on the release branch. They check-out and check-in code on the release branch. There can be more than one release branch where development is running in parallel. Let's say these are sprint 1 and sprint 2.
Once sprint 2 is near completion (assuming that all the local builds on the sprint 2 branch were successful), it is merged to the Integration branch. Automated builds run when there is something checked-in on the Integration branch, and the code is then packaged and deployed in the testing environments. If the testing passes with flying colors and the business is ready to move the release to production, then automated systems take the code and merge it with the production branch.
From here, the code is then deployed in production. The reason for maintaining a separate branch for production comes from the desire to maintain a neat code with less number of versions. The production branch is always in sync with the hotfix branch. Any instant fix required on the production code is developed on the hotfix branch. The hotfix changes are then merged to the production as well as the Integration branch. The moment sprint 1 is ready, it is first rebased with the Integration branch and then merged into it. And it follows the same steps thereafter.
To modify the file, I have to check out the file. This is more like reserving the file for edit. Why reserve? In a development environment, a single file may be used by many developers. Hence, in order to facilitate an organized use, we have the option to reserve a file using the check-out operation. Let's assume that I do a check-out on the file and do some modifications by adding another line.
We have already seen that a version control system is a tool used to record changes made to a file or set of files over time. The advantage is that you can recall specific versions of your file or a set of files. Almost every type of file can be version controlled. It's always good to use a Version Control System (VCS) and almost everyone uses it nowadays. You can revert an entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more. Using a VCS also generally means that if you screw things up or lose files, you can easily recover.
Local version control systems
Centralized version control systems
Distributed version control systems
Initially, when VCS came into existence some 40 years ago, they were mostly personal, like the one that comes with Microsoft Office Word, wherein you can version control a file you are working on. The reason was that in those times software development activity was minuscule in magnitude and was mostly done by individuals. But, with the arrival of large software development teams working in collaboration, the need for a centralized VCS was sensed. Hence, came VCS tools, such as Clear Case and Perforce. Some of the advantages of a centralized VCS are as follows:
All the code resides on a centralized server. Hence, it's easy to administrate and provides a greater degree of control.
These new VCS also bring with them some new features, such as labeling, branching, and baselining to name a few, which help people collaborate better.
In a centralized VCS, the developers should always be connected to the network. As a result, the VCS at any given point of time always represents the updated code.
The following diagram illustrates a centralized VCS:
Another type of VCS is the distributed VCS. Here, there is a central repository containing all the software solution code. Instead of creating a branch, the developers completely clone the central repository on their local machine and then create a branch out of the local clone repository. Once they are done with their work, the developer first merges their branch with the Integration branch, and then syncs the local clone repository with the central repository.
You can argue that this is a combination of a local VCS plus a central VCS. An example of a distributed VCS is Git.
As part of the software development life cycle, the source code is continuously built into binary artifacts using Continuous Integration. Therefore, there should be a place to store these built packages for later use. The answer is to use a repository tool. But, what is a repository tool?
A repository tool is a version control system for binary files. Do not confuse this with the version control system discussed in the previous sections. The former is responsible for versioning the source code and the lateral for binary files, such as
.msi, and so on.
As soon as a build is created and passes all the checks, it should be uploaded to the repository tool. From there, the developers and testers can manually pick them, deploy them, and test them, or if the automated deployment is in place, then the build is automatically deployed in the respective test environment. So, what's the advantage of using a build repository?
Every time a build gets generated, it is stored in a repository tool. There are many advantages of storing the build artifacts. One of the most important advantages is that the build artifacts are located in a centralized location from where they can be accessed when needed.
It can store third-party binary plugins, modules that are required by the build tools. Hence, the build tool need not download the plugins every time a build runs. The repository tool is connected to the online source and keeps updating the plugin repository.
It records what, when, and who created a build package.
It creates a staging area to manage releases better. This also helps in speeding up the Continuous Integration process.
In a Continuous Integration environment, each build generates a package and the frequency at which the build and packaging happen is high. As a result, there is a huge pile of packages. Using a repository tool makes it possible to store all the packages in one place. In this way, developers get the liberty to choose what to promote and what not to promote in higher environments.
What is a Continuous Integration tool? It is nothing more than an orchestrator. A continuous integration tool is at the center of the Continuous Integration system and is connected to the version control system tool, build tool, repository tool, testing and production environments, quality analysis tool, test automation tool, and so on. All it does is an orchestration of all these tools, as shown in the next image.
There are many Continuous Integration tools: Jenkins, Build Forge, Bamboo, and Team city to name a few.
Basically, Continuous Integration tools consist of various pipelines. Each pipeline has its own purpose. There are pipelines used to take care of Continuous Integration. Some take care of testing, some take care of deployments, and so on. Technically, a pipeline is a flow of jobs. Each job is a set of tasks that run sequentially. Scripting is an integral part of a Continuous Integration tool that performs various kinds of tasks. The tasks may be as simple as copying a folder/file from one location to another, or it can be a complex Perl script used to monitor a machine for file modification.
The next important thing is the self-triggered automated build. Build automation is simply a series of automated steps that compile the code and generate executables. The build automation can take help of build tools, such as Ant and Maven. Self-triggered automated builds are the most important parts of a Continuous Integration system. There are two main factors that call for an automated build mechanism:
Catching integration or code issues as early as possible
There are projects where 100 to 200 builds happen per day. In such cases, speed is an important factor. If the builds are automated, then it can save a lot of time. Things become even more interesting if the triggering of the build is made self-driven without any manual intervention. An auto-triggered build on very code change further saves time.
When builds are frequent and fast, the probability of finding errors (a build error, compilation error, and integration error) is also greater and faster.
There is a possibility that a build may have many components. Let's take, for example, a build that has a
.rar file as an output. Along with this, it has some Unix configuration files, release notes, some executables, and also some database changes. All these different components need to be together. The task of creating a single archive or a single media out of many components is called packaging.
This again can be automated using the Continuous Integration tools and can save a lot of time.
IT projects can be on various platforms, such as Java, .NET, Ruby on Rails, C, and C++ to name a few. Also, in a few places, you may see a collection of technologies. No matter what, every programming language, excluding the scripting languages, has compilers that compile the code. Ant and Maven are the most common build tools used for projects based on Java. For the .NET lovers, there is MSBuild and TFS build. Coming to the Unix and Linux world, you have
omake, and also
clearmake in case you are using IBM Rational ClearCase as the version control tool. Let's see the important ones.
Maven is a build tool used mostly to compile Java code. It uses Java libraries and Maven plugins in order to compile the code. The code to be built is described using an XML file that contains information about the project being built, dependencies, and so on.
Maven can be easily integrated into Continuous Integration tools, such as Jenkins, using plugins.
MSBuild is a tool used to build Visual Studio projects. MSBuild is bundled with Visual Studio. MSBuild is a functional replacement for
nmake. MSBuild works on project files, which have the XML syntax, similar to that of Apache Ant. Its fundamental structure and operation are similar to that of the Unix
make utility. The user defines what will be the input (the various source codes), and the output (usually, a
.msi). But, the utility itself decides what to do and the order in which to do it.
Consider an example, where the automated packaging has produced a package that contains
.war files, database scripts, and some Unix configuration files. Now, the task here is to deploy all the three artifacts into their respective environments. The
.war files must be deployed in the application server. The Unix configuration files should sit on the respective Unix machine, and lastly, the database scripts should be executed in the database server. The deployment of such packages containing multiple components is usually done manually in almost every organization that does not have automation in place. The manual deployment is slow and prone to human errors. This is where the automated deployment mechanism is helpful.
Automated deployment goes hand in hand with the automated build process. The previous scenario can be achieved using an automated build and deployment solution that builds each component in parallel, packages them, and then deploys them in parallel. Using tools such as Jenkins, this is possible. However, there are some challenges, which are as follows:
There is a considerable amount of scripting required to orchestrate build packaging and deployment of a release containing multiple components. These scripts by themselves are huge code to maintain that require time and resources.
In most of the cases, deployment is not as simple as placing files in a directory. For example, there are situations where the deployment activity is preceded by steps to configure the environment.
Testing is an important part of a software development life cycle. In order to maintain quality software, it is necessary that the software solution goes through various test scenarios. Giving less importance to testing can result in customer dissatisfaction and a delayed product.
Since testing is a manual, time-consuming, and repetitive task, automating the testing process can significantly increase the speed of software delivery. However, automating the testing process is a bit more difficult than automating the build, release, and deployment processes. It usually takes a lot of efforts to automate nearly all the test cases used in a project. It is an activity that matures over time.
Hence, when we begin to automate the testing, we need to take a few factors into consideration. Test cases that are of great value and easy to automate must be considered first. For example, automate the testing where the steps are the same, but they run every time with different data. You can also automate the testing where a software functionality is being tested on various platforms. In addition, automate the testing that involves a software application running on different configurations.
Previously, the world was mostly dominated by the desktop applications. Automating the testing of a GUI-based system was quite difficult. This called for scripting languages where the manual mouse and keyboard entries were scripted and executed to test the GUI application. Nevertheless, today the software world is completely dominated by the web and mobile-based applications, which are easy to test through an automated approach using a test automation tool.
Once the code is built, packaged, and deployed, testing should run automatically to validate the software. Traditionally, the process followed is to have an environment for SIT, UAT, PT, and Pre-Production. First, the release goes through SIT, which stands for System Integration Test. Here, testing is performed on an integrated code to check its functionality all together. If pass, the code is deployed in the next environment, that is, UAT where it goes through a user acceptance test, and then similarly, it can lastly be deployed in PT where it goes through the performance test. Thus, in this way, the testing is prioritized.
It is not always possible to automate all of the testing. But, the idea is to automate whatever testing is possible. The previous method discussed requires the need to have many environments and also a number of automated deployments into various environments. To avoid this, we can go for another method where there is only one environment where the build is deployed, and then, the basic tests are run and after that, long running tests are triggered manually.
Static code analysis, also commonly called white-box testing, is a form of software testing that looks for the structural qualities of the code. For example, it reveals how robust or maintainable the code is. Static code analysis is performed without actually executing programs. It is different from the functional testing, which looks into the functional aspects of software and is dynamic.
Static code analysis is the evaluation of software's inner structures. For example, is there a piece of code used repetitively? Does the code contain lots of commented lines? How complex is the code? Using the metrics defined by a user, an analysis report can be generated that shows the code quality in terms of maintainability. It doesn't question the code functionality.
Some of the static code analysis tools, such as SonarQube come with a dashboard, which shows various metrics and statistics of each run. Usually, as part of Continuous Integration, the static code analysis is triggered every time a build runs. As discussed in the previous sections, static code analysis can also be included before a developer tries to check-in his code. Hence, code of low quality can be prevented right at the initial stage.
One of the most important parts, or shall we say the backbone of Continuous Integration are the scripting languages. Using these, we can reach where no tool reaches. In my own experience, there are many projects where build tools, such as Maven, Ant, and the others don't work. For example, the SAS Enterprise application has a GUI interface to create packages and perform code promotions from environment to environment. It also offers a few APIs to do the same through the command line. If one has to automate the packaging and code promotion process in a project that is based on SAS, then one ought to use the scripting languages.
It comes free and preinstalled with any Linux and Unix OS
It's also freely available for Windows
It is simple and fast to script using Perl
It works both on Windows, Linux, and Unix platforms
Though it was meant to be just a scripting language for processing files, nevertheless it has seen a wide range of usages in the areas of system administration, build, release and deployment automation, and much more. One of the other reasons for its popularity is the impressive collection of third-party modules.
I would like to expand on the advantages of the multiple platform capabilities of Perl. There are situations where you will have Jenkins servers on a Windows machine, and the destination machines (where the code needs to be deployed) will be Linux machines. This is where Perl helps; a single script written on the Jenkins Master will run on both the Jenkins Master and the Jenkins Slaves.
However, there are various other popular scripting languages that you can use, such as Ruby, Python, and Shell to name a few.
Ideally testing such as SIT, UAT, and PT to name a few, is performed in an environment that is different from the production. Hence, there is every possibility that the code that has passed these quality checks may fail in production. Therefore, it's advisable to perform an end-to-end testing on the code in a production-like environment, commonly referred to as a pre-production environment. In this way, we can be best assured that the code won't fail in production.
However, there is a challenge to this. For example, consider an application that runs on various web browsers both on mobiles and PCs. To test such an application effectively, we would need to simulate the entire production environment used by the end users. These call for multiple build configurations and complex deployments, which are manual. Continuous Integration systems need to take care of this; on a click of a button, various environments should be created each reflecting the environment used by the customers. And then, this should be followed by deployment and testing thereafter.
By introducing automated notifications after each build. The moment a build is completed, the Continuous Integration tools automatically respond to the development team with the report card.
As seen in the Scrum methodology, the software is developed in pieces called backlogs. Whenever a developer checks in the code, they need to apply a label on the checked-in code. This label can be the backlog number. Hence, when the build or a deployment fails, it can be traced back to the code that caused it using the backlog number.
Labeling each build also helps in tracking back the failure.
Defect tracking tools are a means to track and manage bugs, issues, tasks, and so on. Earlier projects were mostly using Excel sheets to track their defects. However, as the magnitude of the projects increased in terms of the number of test cycles and the number of developers, it became absolutely important to use a defect tracking tool. Two of the most popular defect tracking tools are Atlassian JIRA and Bugzilla.
The quality analysis market has seen the emergence of various bug tracking systems or defect management tools over the years.
It allows you to raise or create defects and tasks that have got various fields to define the defect or the task.
It allows you to assign the defect to the concerned team or an individual responsible for the change.
It progresses through the life cycle stages workflow.
It provides you with the feature to comment on a defect or a task, watch the progress of the defect, and so on.
It provides metrics. For example, how many tickets were raised in a month? How much time was spent on resolving the issues? All these metrics are of significant importance to the business.
It allows you to attach a defect to a particular release or build for better traceability.
The way a software is developed always affects the business. The code quality, the design, time spent in development and planning of features, all affect the promises that a company has made to its clients.
Continuous Integration helps the developers in helping the business. While going through the previous topics, you might have already figured out the benefits of implementing Continuous Integration. However, let's see some of the benefits that Continuous Integration has to offer.
When every small change in your code is built and integrated, the possibility of catching the integration errors at an early stage increases. Rather than integrating once in 6 months, as seen in the waterfall model, and then spending weeks resolving the merge issues, it is good to integrate frequently and avoid the merge hell. The Continuous Integration tool like Jenkins automatically builds and integrates your code upon check-in.
Continuous Delivery enables you to release deployable features at any point in time. From a business perspective, this is a huge advantage. The features are developed, deployed, and tested within a timeframe of 2 to 4 weeks and are ready to go live with a click of a button.
How frequent are the releases? What is the success rate of builds? What is the thing that is mostly causing a build failure? Real-time data is always a must in making critical decisions. Projects are always in the need of recent data to support decisions. Usually, managers collect this information manually, which requires time and efforts. Continuous Integration tools, such as Jenkins provide the ability to see trends and make decisions. A Continuous Integration system provides the following features:
Real-time information on the recent build status and code quality metrics.
Since integrations occur frequently with a Continuous Integration system, the ability to notice trends in build, and overall quality becomes possible.
Continuous Integration tools, such as Jenkins provide the team members with metrics about the build health. As all the build, packaging, and deployment work is automated and tracked using a Continuous Integration tool; therefore, it is possible to generate statistics about the health of all the respective tasks. These metrics can be the build failure rate, build success rate, the number of builds, who triggered the build, and so on.
Also, Continuous Integration incorporates static code analysis, which again on every build gives a static report of the code quality. Some of the metrics of great interest are code style, complexity, length, and dependency.
This is the most important advantage of having a carefully implemented Continuous Integration system. Any integration issue or merge issue gets caught early. The Continuous Integration system has the facility to send notifications as soon as the build fails.
In the past, development teams performed the build, release, and deployments. Then, came the trend of having a separate team to handle build, release, and deployment work. Yet again that was not enough, as this model suffered from communication issues between the development team and the release team.
However, using Continuous Integration, all the build, release, and the deployment work gets automated. Therefore, now the development team need not worry about anything other than developing features. In most of the cases, even the completed testing is automated.
From a technical perspective, Continuous Integration helps teams work more efficiently. This is because Continuous Integration works on the agile principles. Projects that use Continuous Integration follow an automatic and continuous approach while building, testing, and integrating their code. This results in a faster development.
Since everything is automated, developers spend more time developing their code and zero time on building, packaging, integrating, and deploying it. This also helps teams, which are geographically distributed, to work together. With a good software configuration management process in place, people can work on large teams. Test Driven Development (TDD) can further enhance the agile development by increasing its efficiency.
"Behind every successful agile project, there is a Continuous Integration server."
Looking at the evolutionary history of the software engineering process, we now know how Continuous Integration came into existence. Truly, Continuous Integration is a process that helps software projects go agile.
The various concepts, terminologies, and best practices discussed in this chapter form a foundation for the upcoming chapters. Without these, the upcoming chapters are mere technical know-how.
In this chapter, we also learned how various DevOps tools go hand-in-hand to achieve Continuous Integration, and of course, help projects go agile. We can fairly conclude that Continuous Integration is an engineering practice where each chunk of code is immediately built and unit-tested, then integrated and again built and tested on the Integration branch.
We also learned how feedback forms an important part of a Continuous Integration system.
Continuous Integration depends incredibly on automation of various software development processes. This also means that using a Continuous Integration tool alone doesn't help in achieving Continuous Integration, and Continuous Integration does not guarantee zero bugs. But it guarantees early detection.