Docker is a technology that allows entire applications and their environments to be encapsulated within individual containers. When multiple versions of these containers are run on a single machine, they are sandboxed from one another as if running on their own dedicated machines.
Docker is open source, which fits well with running Linux in containers, as well as numerous available open source components that help build complex systems. It is the logical progression of technologies used for hosting and backend development over the past decade or longer. This progression has moved from a physical kind of hosting to a logical one and has been driven by several requirements. These requirements include reliability, reachability, scalability, and security.
This book is divided into three sections. The first is an introduction to Docker, focusing on local development. The second describes the methodology for testing, deploying, and scaling applications. The third goes into detail about security when using a container-based design.
In this chapter, we will review the history of hosting and backend solutions with a focus on how Docker came to be a widely used technology.
The following topics will be covered in this chapter:
- Origins of hosting services
- Types of hosting services – co-location
- Types of hosting services – self-hosting
- The benefits of data centers
- How virtualization works
- The power requirements at data centers
- How virtualization is a solution for data centers and the invention of the cloud
- How containers are a bigger win for data centers and hosting
The drivers for Docker
The range of hosting services was originally limited to self-hosted servers, co-located server hosting, and shared hosting. In 1994 and 1995, Best Internet Communications rose from nothing to hosting 18,000+ websites on a pair of Pentium servers, which were the most powerful servers of the time. Best also offered dedicated server-hosting through co-location, dedicated broadband connectivity, and upscale premium services.
Most of the websites hosted by Best were of the shared-hosting variety. All of these sites shared the same server, the same hard drives, the same filesystem, the same RAM, the same CPUs, the same network connections, and so on.
It was not uncommon for any one of these websites to be slashdotted, or containing a link to the site from a very popular site to the hosted site. This would cause a large spike in traffic to the one out of the approximately 18,000 sites, and a performance hit to the others. As the quality of the sites grew and demanded more resources, their administrators would move to dedicated co-located hosting or self-hosting.
The customer can install and manage the machines of their choice. Some co-location facilities offer, for additional fees, remote hands service, where the customer can call the hosting company and one of their engineers does whatever the customer requires to the hosted servers. The cages are locked so that other customers can't gain access to other customers' equipment.
The benefits of a professional data center are numerous, and ultimately, the trend became that just a few companies, relative to all the companies with an internet presence, provided data centers, and the remaining companies paid rent for dedicated, shared, or premium hosting. A professional data center provides rich internet connectivity (more than one provider, faster connections), clean power, battery-backed-up power for 24/7/365 uptime, back-up generator-backed-up power for longer brownouts or blackouts, fire-suppression systems, a controlled climate suitable for keeping equipment at the proper operating temperatures, multiple physical locations, a professionally managed Network Operations Center (NOC) and technical support, and security in the form of guards, cameras, and fingerprint, handprint, and/or retina scanners:
The companies that ended up building and running the majority of data centers are Google (Google Cloud Platform), Microsoft (Azure), Amazon (Amazon Web Services (AWS)), Yahoo! (once upon a time), and lesser players, which include boutique hosting companies, regional hosting companies, and companies that require security beyond what a hosting company can provide (for example, banks and financial institutions, governments, and so on).
Amazon had a unique need for data centers. They are one of the largest online retailers in the world, as well as the largest data center developer/owner. The number of servers, the uptime, the security, and the reach that they require drove them to build data centers throughout the country and then the world.
Google has a unique need for data centers as well. They are the largest search engine and advertising company in the world. In order to be reachable, Google needs servers in as many physical places as possible. In order to be fast, Google needs many servers—at least enough servers for distributed search index processing in each of its geo-locations.
Companies such as RackSpace and Level 3 were originally built as data center providers. Their specialties included co-location facilities, dedicated server hosting, remote hands, NOCs, nationwide-dedicated fiber-optic backbones, clean and blackout resistant power, and very rich connectivity to various other networks, including AT&T, Verizon, and Comcast. They found themselves with the infrastructure to follow the trend toward virtualization and began to offer these cloud services.
The highest cost of providing data center services, and this passed on to the customer, was initially bandwidth. The providers paid for bandwidth by the megabit, plus a monthly cost of maintaining the physical connections that carried this bandwidth. As the providers built their own private infrastructure to carry data between their own data centers around the world, the cost became a flat rate, or a fixed cost, for a significant amount of the total bandwidth used. This allowed the price of bandwidth to decline to the point where it became a minimal consideration for hosting.
These companies ended up building a comprehensive infrastructure for dedicated hosting. It turns out that this infrastructure is ideally suited for virtualized product offerings, too.
Using virtualization to economize resource usage
Virtualization is the process of exposing a portion of a physical machine as a logical or virtual machine that acts enough like a real machine that it supports the installation of whole operating systems, their filesystems, and the software that runs on the operating system. For example, a machine with 64 GB of RAM and 4 CPUs could run virtualization software that masquerades as four 16 GB RAM machines with 1 CPU each. This machine could run four instances of Linux.
Virtualization is not a new concept, having been implemented by IBM in the early 1960s. It likely gained in overall popularity during the 1980s when it was used to run MS-DOS, and then Windows by computer systems such as the original Apple Macintosh (Mac) and Unix computers such as the Sun and Silicon Graphics workstations.
Initial virtualization software used what features were available on CPUs of the time, but often simply emulated the instruction set of the x86 on the 68000 family or custom CPUs of the professional Unix workstations. SoftPC was one of the most popular offerings in the 1980s.
SoftPC was quite slow, but the ability to run Windows or MS-DOS applications on a Mac computer allowed the use of these machines in business and education environments. Instead of adding Microsoft Office compatibility to all the programs on the Mac to support exchanging files between Windows/MS-DOS users and Mac users, users could run Microsoft Office.
People saw it in action and saw the value in it. Windows was the dominant operating system for home and business, and to fit in with Windows in the corporate environment, something like SoftPC was needed. The problem with SoftPC is that it was pure software emulation, which was quite slow in actual use. Virtualization is superior to emulation in terms of performance!
Entire companies were founded around providing consumer or business virtualization solutions. VMWare, founded in 1998, was one of the first of these companies.
Innotek developed VirtualBox and released it as open source in 2007, and was then acquired by Sun Microsystems in 2008. Then, Sun was acquired by Oracle in 2010. Parallels, a virtualization solution for Mac, was developed in 2004 and became mainstream in 2006.
The value of virtualization encouraged chip makers to gradually add CPU support for virtualization. With CPU support, an x86-based system could run virtualized machines or software at close enough to native speed to be much more tolerable. This, in turn, led the workstation companies (such as Apple, Sun, and Silicon Graphics) to move to x86 CPUs.
A key component of virtualization software is the hypervisor. The hypervisor presents the virtual machine to the chosen operating system and then manages the resources and execution of the virtual machines over time. The virtual machines themselves are configurable, at least regarding the amount of RAM, the number of logical CPU cores, graphics card memory, the host operating system disk files to act as virtual disk drives in the virtual machine, the mounting and unmounting of CD-ROM in the virtual CD-ROM drive, and so on. The hypervisor assures that these resources are truly available and that no virtual machine starves the other virtual machines for the host machine's resources.
For the enterprise, the requirements were somewhat different. Instead of providing virtual machines via a general-purpose host operating system such as Linux, the entire operating system itself could be optimized just for being the hypervisor. VMWare offered its Elastic Sky X Integrated (ESXi) operating system in 2004. The University of Cambridge computer laboratory developed the Xen hypervisor in the late 1990s, and the first stable version was released in 2003. Xen was originally the hypervisor used by Amazon for its Elastic Compute Cloud offering, before moving to KVM.
KVM is a virtualization solution supported directly by the Linux kernel. The kernel can act as the hypervisor under KVM. KVM can additionally emulate processors other than the host's native CPU, which is typically x86. This allows KVM to be used to emulate targets such as the Raspberry Pi.
Scaling a dedicated hosted website can be problematic. It's possible to simply upgrade to a larger and more powerful server to handle growing traffic and services. At some point, however, there is no server that is large and powerful! To scale up from that point requires distributing services across multiple servers.
Addressing the increasing power requirements
The trend toward virtualization created a demand for a new breed of servers to be housed at the data centers. Where a customer might have rented or installed their own dedicated server with 16 GB of RAM, the virtual server provider could rent a portion of a 128 GB RAM server and share that server with multiple customers. These bigger servers required more CPU cores, so the virtual servers could have reasonable computing capabilities.
Fitting these specialized servers into the same space as the smaller and less capable dedicated servers created a new challenge: power. Instead of using 400 watts of power for the dedicated server, the cloud servers might use 1,600 watts; the power requirements would be four times more. In addition to the power requirements of the machines themselves, it took more power to run the air conditioners to cool the machines.
The power cost requirements changed the equation for dedicated hosting, so bandwidth pricing was virtually free, while the power requirements of the servers were charged at a very high price.
To help mitigate the cost of power, data centers have been built to provide some of their own power. Solar panels, building near a river that can drive turbines, wind turbines, and building in places with cool or cold climates are among some of the techniques used. Data centers do use batteries for back-up power, and diesel-powered generators as well.
Energy efficiency is another way to mitigate power costs. The use of lower-powered CPUs and other computer parts is one means to this end. The CPU manufacturers have had a heavy focus on producing lower-powered CPUs for both data center and laptop use.
The hosting companies would supply a 60 watt power supply for each co-location cage. If you needed more than 60 watts, you could pay extra to have additional 60 watt lines for your cage. You'd pay for the construction and then the monthly power usage.
Hosting at one of these facilities was problematic for most customers. It required purchasing physical machines and other hardware, designing the infrastructure required for the services to be provided, physical access to the cage and hardware from time to time, and potential failures, which meant downtime.
The growth and popularity of services require scalability or more and bigger machines. You could repurpose old machines, but they take up space and power. Customer costs soared when the current cage filled up and more presence was required.
The next step, and the solution to these hassles, is virtualization and running your servers and services within the cloud.
Virtualization and cloud computing
Most customers don't need dedicated servers. What they really need is the security of a filesystem that only their software can read and write to, that the CPU is guaranteed to be dedicated to their purpose, and that the throughput and computing power is identifiable and delivered as expected.
The appeal of virtual servers offered by companies such as AWS drove many administrators to move away from dedicated and self-hosting. AWS grows its offerings to add more value to virtual hosting, so their customers get the benefit of Amazon's developers efforts.
It's relatively cheap to duplicate the customer-designed infrastructure to create a testing environment that is separate from the live/deployed applications. It's easy to scale services that grow with popularity, or when the services are slashdotted. This is a term that describes what happens when a very popular site adds a link to another site, driving a lot more traffic to that site—perhaps more traffic than the site was designed to handle.
The design and deployment of a virtualized infrastructure can be done from the comfort of your office. There is no need to physically visit a data center. If you need to scale horizontally, you only need to spin up additional virtual machine instances. If you need to scale vertically, you only need to spin up a more powerful virtual machine and substitute it for the one that is too slow or too small.
If hardware fails at a cloud-hosting facility, the hosting company's employees install new hardware. This is done in complete transparency with you, the customer. A feature known as Teleport allows the hosting company to move a running virtual machine to a different physical machine, without the interruption of service.
Along with virtual servers, hosting companies can also offer virtual disks, elastic IPs, load balancers, DNS, backup solutions, and so on. Virtual disks are handy because you can back them up by simply copying the file that is the image. You can also boot new instances from an existing virtual disk, saving the time required to install a whole operating system on a virtual machine.
The ability to use elastic IPs and virtual load balancers allows a scalability that is as easy as the click of a mouse.
You can assign an elastic IP to any virtual instance or load balancer. If the instance is stopped, you can reassign that IP to another instance. If this were handled only with DNS, there could be days' worth of delays for the DNS to propagate through the many DNS servers at the ISPs. The load balancer allows you to create virtual server farms and balance incoming requests between the virtual servers in the farm. You can trivially spin up and add additional virtual servers to the load balancer as you need to scale. The hosting companies can even provide software triggers that will automatically spin up and add new servers when traffic increases, and then spin them down and remove them when traffic is reduced:
A popular stack technology at the time that AWS was made available to the public was LAMP, which is short for Linux, Apache, MySQL, and PHP. A typical setup would be to install these four software packages on a dedicated Linux server. AWS offered RDS, or a MySQL equivalent dedicated virtual server, which allowed the offloading and scaling of the LAMP application. AWS offered virtual load balancers, which are logical Ethernet switches that load balance traffic among two or more web servers. They offered domain name-hosting and elastic IPs, so a site's uptime could be almost infinite. AWS continues to develop new software and services to benefit its customers.
AWS and its competitors allow a cost-effective and dynamic way to grow an internet presence as it gains popularity. The price structure is common among most providers. The cost is based on the number of elastic load balancers, the number of virtual server instances, the amount of RAM, the number of virtual CPUs, the size of persistent storage, and the bandwidth. There are also optional additional services that can increase the price.
Virtual servers provide the benefits of a physical one, but it comes at the cost of the dedication of physical RAM on the host machine and the power required to run the machine. A host machine might have 64 GB of RAM; it can run some combination of virtual machines that, combined, use up that RAM—for example, four 16 GB virtual machines, two 32 GB virtual machines, two 16 gigabytes and one 32 GB virtual machine, and so on.
A risk of virtual machines is that when the host machine is rebooted or fails, all the virtual machines hosted on it will go off air.
Using containers to further optimize data center resources
Docker is a clever use of OS-level virtualization support that allows multiple Docker containers to execute on a single machine. A container is a running instance of a container image. The containers are, by default, isolated from the host machine, as well as from one another.
They can be configured to expose resources, such as networking ports, to the host network (for example, the internet) or to one another. The following diagram illustrates the basic structure of containers on a host:
Containers share their Linux kernel with the host, so you do not need to install complete operating systems within the container as you do with virtual machines. The containers are managed by the Docker daemon, which handles the management of the containers and resources they use, as well as the images, networks, volumes, and so on.
An important distinction between virtual servers and containers is that containers share the resources, directly, of the host, whereas virtual servers require duplicate resources. For example, two identical containers use the host's RAM, rather than a block of RAM configured before booting the virtual machine. If you need to constrain the resources (the CPU, memory, swap, and so on) of a container, you can do so, but the default is to have no resource constraints on any container.
Unlike with virtual servers, you deal with an application image, rather than a virtual disk. You can copy the image to back it up, but there is no virtual disk file to copy. These application images are progressively built on top of other containers. When you build a container, only the bits of the application image that change need to be dealt with.
When designing services that use containers, you will not likely install many components within any one container. For a virtual machine running a LAMP application, you might install Apache, MySQL, and PHP all within one virtual machine. When designing the same LAMP application for containers, you might configure one container just for MySQL and another for Apache and PHP. You can then scale your application by running additional Apache and PHP containers and additional MySQL instances in a cluster configuration.
If we consider the use of containers for the LAMP application discussed earlier, we can implement MySQL in a dedicated container, and Apache and PHP in another; all this running on top of the host's Linux kernel. To scale the LAMP application, a second, third, fourth, and so on instance of the Apache/PHP container can be spun up, and the same is true for the MySQL container. MySQL containers can be configured for master-subordinate operations.
If the host operating system is not Linux kernel-based, there are two options. The first option is to run host OS native containers (for example, Windows containers on a Windows host). The second option is to run a Linux virtual machine on the host and run the containers within that virtual machine.
Containerization is a boon for hosting companies and their customers. No longer is it required to dedicate a fixed amount of RAM per container as is required with virtual machines. A physical machine is limited only by its resources when it comes to the number of containers it can run concurrently. The pricing model for containers can save customers on monthly costs. Thus, containerization is a big win.
In the next chapter, we'll look at how to use virtual machines and Docker to develop applications locally. Later in this book, we'll look at how to deploy our locally developed software to publicly accessible internet/cloud infrastructure.
In this chapter, we saw how Docker and containerization was a natural result of the progression of hosting requirements since the start of the commercial internet. We reviewed the history of hosting and how we got to today's hosting configurations. You should now have a decent understanding of the difference between virtualization and containerization.
In the next chapter, we'll look at VirtualBox and Docker. This is a good way to explore the differences between virtual machines and Docker containers.
If you would like to look into some of the subjects discussed so far in-depth, refer to the following links:
- This link partially describes how Google's search algorithm is implemented: https://www.google.com/search/howsearchworks/
- This link describes Google's search infrastructure: https://netvantagemarketing.com/blog/how-does-google-return-results-so-damn-fast/
- This link also describes Google's search infrastructure: https://www.ctl.io/centurylink-public-cloud/servers/
- This link describes IBM's early technology to support virtualization: https://en.wikipedia.org/wiki/IBM_CP-40
- This link describes an old program that emulates a PC to run Windows on a non-Windows host: https://en.wikipedia.org/wiki/SoftPC
- This link provides an introduction to the VMWare company: https://en.wikipedia.org/wiki/VMware
- This link describes Oracle's VirtualBox: https://en.wikipedia.org/wiki/VirtualBox
- This link introduces Parallels: https://en.wikipedia.org/wiki/Parallels_(company)
- This link discusses the role of the Hypervisor in virtualization and containerization: https://en.wikipedia.org/wiki/Hypervisor
- This link describes VMWare's standalone operating system designed specifically to run virtual machines: https://en.wikipedia.org/wiki/VMware_ESXi
- This link describes the Xen hypervisor: https://15anniversary.xenproject.org/#Intro
- This link describes Amazon's AWS virtual machines: https://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud
- This link describes kernel features to support virtualization and containerization: https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
- This link describes using QEMU to emulate Raspberry Pi on a workstation: https://azeria-labs.com/emulate-raspberry-pi-with-qemu/