As the internet has grown, people have become used to having access to content all of the time, from a variety of devices. This means that the reputation of a brand has slowly become connected with the responsiveness and reliability of its products. People choose Google for searching because it always returns relevant and useful results quickly. People share content on Twitter because their message will be seen in real time by their followers. Netflix's great content selection is useless if it cannot deliver consistently on a variety of network speeds. As this reliability has become more important to businesses, a specialization focused on software reliability has emerged: Site Reliability Engineering (SRE). This chapter will introduce you to the field and also describe what you will learn from this book, helping you to write software to navigate the ever-changing internet landscape.
Before we explain what the field and role of SRE pertains to, let us start with a thought experiment. Imagine that it's early in the morning and you wake up to a screenshot of a blank web page in a text message from a friend with the caption: "I can't load your website."
If your personal website is indeed down, maybe you will message back with an, "I'll check it after breakfast," or an, "Oh yeah, been meaning to look into that." If it is your company's website, or maybe the page hosting your resume that you just sent to 15 possible employers, then a stream of expletives and indecipherable emojis will probably erupt from your mouth and in your text message back. This is because, for many businesses, websites have become the main source of incoming business. For some companies, like Facebook, Amazon, or iFixit, their entire business is a website. For other businesses, like restaurants or advertising agencies, a website acts as a way for people interested in the organization to learn more. It is often part of the marketing flow that helps companies to grow.
It is probably impossible to completely remove the adrenaline spike that comes from discovering a website is down if you are responsible for fixing it. However, we can work to set up a framework to limit how often things break. We can create a world where responding to outages is easy, and transition from, "Oh god, everything is on fire, what do I do?!" to "Oh hey, a page isn't loading, so let's check out what's having a rough day."
This chapter is our introduction to the book and the field of SRE. We will cover the following topics in the next few pages:
Exploring a brief history of the people who work on information systems
Defining what SRE is
Describing what is in the book and providing a rough framework for SRE.
SRE is a relatively new field, but it is a slightly different take on many existing ideas. In 1958, the term IT was coined in the Harvard Business Review, and eventually became the descriptor for the maintenance of technology used for collecting, storing, and distributing data and information. At that time, computers were transitioning toward having integrated circuits, but they were still the size of a room and were maintained and programmed by a team of people. As computers shrank, that team started focusing on multiple computers. Over time, some people started to specialize in programming those computers, and others focused on keeping them running. "Dumb terminals" would connect to a single computer, which was maintained by a team while programmers and users used the terminals.
Eventually, these maintainers started taking care of both the machines that individuals used, as well as large arrays of machines that provided services. Users would use a word processor on their local machine, and then upload files to a remote machine. Those who maintained the remote machines became known as system engineers, system administrators, and system operators.
As computers became smaller and more commodified, programmers began spending more time interacting with infrastructure, and configuring their software and infrastructure to work together well. On the other end, system admins were writing more and more complex code to maintain infrastructure. The closer these teams became, the more they began working together. In smaller teams, often, people would start focusing on both code for infrastructure and business code. In larger organizations, teams were created that focused on tools for managing infrastructure in reliable ways, so that product teams could quickly and easily manage the infrastructure they needed. These joint teams were often described as SRE or DevOps (developer and operations) teams.
Benjamin Treynor Sloss of Google, often referred to as just Treynor, says in Google's Site Reliability Engineering book, "SRE is what happens when you ask a software engineer to design an operations team." He is often credited with the creation of the idea that operations work is now just a specialization of software engineering. Given Google's success with reliability, the idea has caught on at many companies.
SRE is still a burgeoning field and, like DevOps, is often used to describe roles that include a wide diversity of work. Some companies give the title of SRE to a position, but it is much closer to a traditional system admin role. You can use this book's framework to evaluate a job before you apply for it, however, the goal of this book is to introduce you to the SRE mindset and help you to apply it to an organization, regardless of your past experience in the tech world.
SRE is an exciting field. As mentioned earlier, it has evolved from a long line of roles and, as it is a relatively new field, its definition is steadily changing. SRE is an extension and evolution of many past concepts and, as such, concepts relevant to SRE apply to many roles, including but not exclusive to, backend engineering, DevOps, systems engineering, systems administration, operations, and so on. Depending on the company, these roles can involve very similar or very different responsibilities. The point is that, no matter what your job title is, you can apply SRE principles to your role.
In an attempt to define the field, we can learn a lot from its full name, Site Reliability Engineering:
Site: As in website
Reliability: Defined as, "The quality of being trustworthy or of performing consistently well." (Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/reliability)
Engineering: Defined as, "The action of working artfully to bring something about." (Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/engineering)
Merging these three definitions, we get something like, "The field focused on working artfully to bring about a website that performs consistently well." While this definition could use some brushing up, it suits our needs for now. If you work, or know people who work, in the web development or software engineering world and you ask them what SRE means, then they may ask you, "Isn't that like X?" To someone from that background, X might be "DevOps," "ops," "platform engineering," "infrastructure engineering," "24/7 engineering," "a sysadmin," and so on.
This variation of answers presents the first problem we will see throughout this book: every organization is different. SRE's primary goal is making a website perform consistently according to our previous definition, which is difficult because it is dependent on the organization, the business around that organization, and the website's (or product's) requirements. One of the primary goals of this book is to present a framework that you can apply even if you do not belong to an organization with any of the aforementioned roles. The framework should be effective if you work for yourself, and it should also work if you are employed by some gigantic international multi-headed Hydra organization, and anything in between.
I worked as an SRE in 2016 for Hillary for America. It was the lead organization (but definitely not the only one) working to help to elect Hillary Clinton as President of the United States of America. We were not successful, and while this example immediately dates this book, I found it to have the most concrete separation of concerns between the parts of a website that I have ever worked on. The organization was hyper-focused on one goal (electing Hillary Clinton as president), so it had a very explicit list of goals that made my job a lot easier.
There were many separate parts of the campaign that the technology team worked on, including a mobile application, different websites, data pipelines, and large databases. To keep this simple though, and to explain what I mean by a separation of concerns, let me use three separate websites that we built and maintained as an example:
The home page for Hillary Clinton's campaign: https://www.hillaryclinton.com/
The donate page for her campaign: https://www.hillaryclinton.com/donate/
The voter registration page for her campaign: https://www.hillaryclinton.com/iwillvote/
The home page was a general landing page. It needed to be available during the hours that people in North America were awake (as our target audience was mostly based in the United States), but very few people visited the home page unless driven there.
The main reason you would go to https://www.hillaryclinton.com/ was if you were sent there, not because it was part of your daily browsing like you would visit Twitter or Reddit. Surrogates speaking at rallies, on the radio, or on television supporting Hillary Clinton would often say things like, "Go to hillaryclinton.com now to sign up," or "hillaryclinton.com has more details on her policies on this topic." A five-minute outage here and there was OK, because of this semipredictable traffic spike, but like many media organizations, there were no guarantees of when a large spike of traffic would occur.
The donate page always needed to be up. According to our product team and senior leadership, the donate page's availability was priority number one. If people could not give money, then the campaign might not be able to pay people's salaries or get the candidate to her speaking engagements. The donation site was not the only way that the campaign made money, but it was a significant source of income.
The voter registration page only needed to be fully available when there was an election coming soon. This was because the page let people say they were going to vote for Hillary Clinton and find their nearest polling location. While the donate page needed to be available for the majority of the campaign (May 2015 through to November 2016), the voter registration page only really needed to be available during the lead up to the primary election (September through to November of 2016). If we had built the voter registration page earlier in the election, it also would have been needed in the days leading up to the primaries, but then only for states that were voting on those days. Primary elections are a precursor to the general election and happen from February to June, with different states voting on different days.
The key here is that different websites and features have different requirements and a different definition of being reliable. Nothing will ever be perfect, nor is 100% uptime achievable on the internet, because things are always breaking. So, all we can do is figure out what sort of failures we might have and optimize our product to be resilient in a way that is useful for us. SRE isn't just the analysis of systems; it is also the architecting and building of systems so that they meet the requirements of the product.
Software on the internet can never be fully reliable for two reasons. The first reason is that the internet is a distributed system and, often, parts fail, which will affect your service's availability. The second reason is that humans write software, and that software will often have bugs, which will also cause outages.
Often, the job of someone working in SRE is to take in reliability requirements for software, and its infrastructure, and then figure out how to make the infrastructure meet those requirements. Steps toward this often require figuring out if existing infrastructure is meeting those needs, collaborating with teams (or people writing software that will run on the infrastructure), evaluating external tools, or just designing and writing what you need yourself.
As I mentioned at the beginning of the chapter, an SRE role can be very diverse. The requirements of an SRE position at a Fortune 500 company can be very different to those of a 20-person video game company. The role could be different at a bank in the USA from a role at a bank on the other side of the world. This is because the organization is different. For smaller organizations, someone working as an SRE may handle everything in the organization related to infrastructure and reliability. On the other hand, larger organizations may have multiple teams of SREs working with many diverse teams of developers. The role between two different banks could be different because of each bank's needs.
A local bank may only need someone to improve the reliability of tools for people who work for the bank, while a much larger bank in London may need someone who can make sure their bank's systems can make trades at very high speeds with the London Stock Exchange or support millions of individual customers. This book will provide a structure for anyone interested in becoming an SRE. The goal is to empower you, no matter your background or current situation. It will not be a panacea but will provide a knowledge base and a framework for making sites more reliable and moving your career forwards.
I worked as an SRE at Google for four years, and that is where I started specializing, moving away from being a full stack engineer, and instead considering myself an SRE. Google had lots of internal education courses, and when I left, I found it difficult to continue my education. I also quickly discovered that SRE at Google is a very different beast than SRE at much smaller organizations. I decided to write this book for people interested in starting with SRE or applying it to organizations that are much smaller than Google.
To do this, the book is broken up into two parts. The first eight chapters walk through the hierarchy of reliability. This hierarchy was originally designed by Mikey Dickerson of the United States Digital Service (and– surprise, surprise –Google). The hierarchy says that as you are trying to add reliability to a system, you need to walk through each level before you get to the next one.
The following diagram shows a slightly modified version of Mikey's original pyramid. I have updated it to include the all-encompassing aspect of communication:
Let us walk through the layers as a preview of what you can expect in each chapter.
Chapter 2, Monitoring: The first level is monitoring, which makes sure that you have insight into a system, tracking health, availability, and what is happening internally in the system. Monitoring is not just tools though, because it also requires communication. Monitoring is a very contentious part of SRE and operations because, depending on implementation, it can either be very useful or very pointless. Figuring out what to monitor, how to monitor it, where to store the monitoring data, who can access historical monitoring data, and how to look at data often takes time. Many people in your engineering organization will have opinions on these points based on past experiences.
Some engineers will have had bad experiences and will not think monitoring is worth the investment, whereas others will have religious zealotry toward certain tools, and some will just ignore you. This chapter will help you to navigate all of these competing opinions and find and create the implementation that is best for your project and team.
Chapter 3, Incident Response: The next level is incident response. If something is broken, how do you alert people and respond? While tools help with this, as they define the rules by which to alert humans, most of incident response is about defining policy and setting up training so humans know what to do when they get alerts. If team members see an automated message in Slack, what should they do? If they get a phone call, how quickly do they need to respond? Will employees be paid extra if they have to work on a Saturday due to an outage? These are all questions we will address in the What is incident response section. Setting up on-call rotations, best practices for working together as a team, and building infrastructure to make incidents as low-stress as possible will also be covered.
Chapter 4, Postmortems: The third level is postmortems. Once you have had an outage, how do you make sure the problem does not happen again? Should you have a meeting about your incident? Does there need to be documentation? In this chapter, we will consider how to talk about past incidents and make it an enjoyable process for all involved. Postmortems are the act of recording for history how an incident happened, how the team fixed it, and how the team is working to prevent another similar incident in the future. We want to set up a culture of blameless and transparent postmortems, so people can work together.
Individuals should not be afraid of incidents, but rather feel confident that if an incident happens, the team will respond and improve the system for the future, instead of focusing on the shame and anger that can come with failure. Incidents are things to learn from, not things to be afraid and ashamed of!
Chapter 5, Testing and Releasing: The fourth level is testing and releasing your software. In this chapter, we will be talking about the tooling and strategies that can be used to test and release software. This level in the hierarchy is our first level where instead of focusing on things that have happened, we focus on prevention. Prevention is about trying to limit the number of incidents that happen and also making sure that infrastructure and services stay stable when releasing new code. The chapter will talk about how to focus on all of the different types of testing that exist and make them useful for you and your team. It will also explore releasing software, when to use methodologies like continuous deployment, and some tools you can use.
Chapter 6, Capacity Planning: The fifth level is capacity planning. While Chapter 5, Testing and Releasing focused on the current world, this chapter is all about predicting the future and finding the limits of your system. Capacity planning is also about making sure you can grow over time. Once you are monitoring your system, and running a reliable system, you can start thinking about how to grow it over time, and how to find and anticipate bottlenecks and resource limits. In this chapter, we will talk about planning for long-term growth, writing budgets, communicating with outside teams about the future, and things to keep in mind as your service shrinks and grows.
Chapter 7, Building Tools: The sixth level is the development of new tools and services. SRE is not only about operations but also about software development. We hope SREs will spend around half of their time developing new tools and services. Some of these tools will exist to automate tasks that an employee has been doing by hand, while others will exist to improve another part of the hierarchy, such as automated load testing, or services to improve performance. In this chapter, we will talk about finding these projects, defining them, planning them, and building them. We will also talk about communicating their usefulness to your fellow engineers.
Chapter 8, User Experience: The final tier is user experience, which is about making sure the user has a good experience. We'll talk about measuring performance, working with user researchers, and defining what a good experience means to your team. We will also discuss how the experience of a tool and processes can cause outages. The goal is to make sure that, no matter the tool, or the user, people enjoy using it, understand how to use it, and cannot easily hurt themselves with it.
Nori Heikkinen, an SRE at Google with many years of experience, adds that "the hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organization before prevention can be examined." (https://www.infoq.com/news/2015/06/too-big-to-fail)
The last two chapters of this book are a cheat section and introduction to common useful topics.
Chapter 9, Networking Foundations: This is a selection of tools and definitions of important ideas in networking. We discuss network packets, DNS, UDP and TCP, and lots of other things. After this chapter you should feel like you know the basics of networking, and the ability to research more advanced topics.
Chapter 10, Linux and Cloud Foundations: This is a selection of tools and important concepts involved in Linux and modern cloud products. We cover what the Linux kernel is, common parts of public clouds, and other topics. After this chapter you should feel like you know the basics of Linux and most public cloud products. Afterwards you should feel comfortable researching specific clouds and more advanced Linux topics.
One way to use this book is as a framework for working on a new project. As each chapter is about a different level of the hierarchy, you can work through the book to figure out where in the hierarchy your project sits. If it is a new project, then often it will be right at the bottom of the hierarchy, with no, or very little, monitoring implemented.
At each level, if there are others on the team, then you should begin a conversation to figure out what exists, and if it meets the team's needs. Each chapter will provide a rough rubric for that discussion, but remember that every team and project is unique. If you are the only person who is thinking about reliability and infrastructure, then you may end up spending a significant amount of time proposing solutions and pushing the project in a certain direction. Just remember that the point is to improve the reliability of the service, help the business, and improve the user's experience of the service.
You may find yourself distracted by each thing that you could fix. It is highly recommended to document the problems that you see first before diving in. Documenting first can be helpful in a few ways. Diving in is very satisfying, but it also may lead you to skip over requirements or spend too much time on a solution that doesn't work for your business (for example, integrating your system with a monitoring service you can't afford, or building a distributed job scheduler when you could have just used a piece of open source software).
So, when joining a new project, or evaluating a new service, here is a set of steps to follow:
Figure out the team structure. Who owns what? Who is in charge?
Find any documentation the team has for their service or the project.
Get someone to draw out the system architecture. Have them show you what connects to which service, what depends on the project, how data flows through the service, and how the project is deployed.
Things they know/specializations
Junior Full Stack Dev
Seems pretty new and jumps around a lot.
Senior Frontend Dev
Does a lot of initial design prototyping and built most of the frontend originally.
Senior Mobile Dev
Wrote both mobile apps.
Senior Backend Dev
TO DO: Set up a one-on-one to understand mobile backend.
Full Stack Dev
Animation wizard who knows the database for CMS better than anyone.
Full Stack Dev
Frontend architecture, made initial protocol buffers and knows sync queue best.
Table 1: An example table with notes on people in the project. With this, we have a reference on team structure. If we need to know who to talk to about mobile apps, we can look at our handy chart and see that we need to talk to Kareem or the manager, Melissa.
Now that you have context for the project, or service, start working through each chapter of the book and ask:
Does the service have monitoring?
Does the team have plans for incident response?
Does the team create postmortems? Are they stored anywhere?
How is the service tested? Does the project have a release plan?
Has anyone done any capacity planning?
What tools could we build to improve the service?
Is the current level of reliability providing a positive user experience?
The trick to note here is that these questions could be asked about a piece of software that has been running for years, as well as one that is just being created.
The service you are investigating could be a large project with many pieces of software (a service-oriented architecture (SOA) for example) or a single monolithic application. If you are working on a project with many services, then work through each service one at a time. The downside of this can be that if you want to build a framework that will fit all of the services you are interacting with, you will not know how best to solve the problems and needs of them until after you have done a bunch of research and work. The upside is that you will not be pulled immediately in many directions and will be able to focus on one specific service's problems.
Your time and energy are limited resources and, because of this, you will always need to work with more people than you have time for, so make sure to take it slow. Going slow will mean that things do not get lost in the cracks. You also do not want to burn out before each service has its base few levels of its hierarchy filled up.
Alright! We made it through the introduction. We learned what SRE is at a high level, and we talked about the sorts of problems people in the role tend to focus on. We discussed the structure of the book, and also how to apply that structure to a software project.
In the next chapter, we will be diving into the world of monitoring! Monitoring is the foundation of learning about a system. It is how you record historical data about a system and learn about what is actually going on by analyzing the data you collect. By the end of the chapter, you'll know the basics of instrumenting an application, aggregating that data, storing that data, and displaying it.
Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/reliability
Oxford Living Dictionary, 2017, https://en.oxforddictionaries.com/definition/engineering