Home

Becoming a Rockstar SRE

By Jeremy Proffitt , Rod Anami

Book + AI Assistant

eBook + AI Assistant $35.99 $24.99

Print $44.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook + AI Assistant $35.99 $24.99

Print $44.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!

Publication date:: April 2023
Publisher: Packt
Pages: 420
ISBN: 9781803239224
Download code from GitHub

SRE Job Role – Activities and Responsibilities

A lot has been said about site reliability engineering, what it is, what it is not, and the multiple practices and techniques that we should apply to adopt the site reliability engineering model. Who site reliability engineers (SREs) are is often put aside even though it is a crucial aspect. Moreover, how people from various parts of information technology (IT) become SREs and how some of them are recognized as thought leaders in this domain.

However, little has been said about the site reliability engineer persona, as detailed in the following list:

What do they know?
Which skills have they developed?
What do they do daily?
What are their primary responsibilities?

Those characteristics would explain, at a bare minimum, why someone should start the journey to becoming an SRE rockstar. That’s precisely why we decided to start this book by outlining the SRE job role.

In this chapter, we’re going to cover the following main topics:

Making this journey personal
Understanding the mindset and hobbies of an SRE
DevOps engineers versus SRE versus others
Describing an SRE’s main responsibilities
An overview of the daily activities of an SRE
People that inspire

Making this journey personal

Unfortunately, often when an enterprise starts to adopt SRE into their IT governance processes, they don’t use a people-processes-tools (PPT) model to transform their operations and software development areas, having a clear vision of these pillars. Even more often, they don’t emphasize or focus on the people element of PPT in such transformations. We want to change that by making this learning journey personal and centered on the individuals rather than the involved processes or technologies.

It’s critical to understand (and learn) what drives typical SREs forward, which fundamental skills they have developed, and how they hone their skills over time to go above and beyond at work. For that purpose, we will divide this subject into three sections:

SRE driving forces
SRE skills
SRE traits

Let’s start this personal journey by understanding why you should become an SRE.

SRE driving forces

We want to explore what motivates or incentivizes site reliability engineers. There’s no journey of any nature if there is no driving force pushing you through. As a word of advice, we should warn you that learning about site reliability engineering is more of an expedition than a tourism trip. In other words, it’s more a marathon than a sprint. Having clarified that, we’ll begin by putting the possible rewards of this journey on the table. Let’s depict each driving force as a mockup code snippet (JavaScript) to make it fun.

Money

If we could represent in the form of an algorithm how money drives people when they don’t earn enough, it would look like the following:

// money
if (money < MyMinimumSalary) {
motivated = false;
excitement--;
}
doMyWork();
if (motivated && jobSatisfaction) {
    honeSRESkills();
    doExtraWork();
} else lookForAnotherJob();

Site reliability engineers make more money than most other technical professionals. According to a Glassdoor (2022) report, they can earn more than USD 118K per year on average. In similar reports, SREs are even noted to have surpassed DevOps engineers in a salary comparison. Nevertheless, not making enough money can be a key demotivating factor. It is hard for anyone to move forward with their career if they are preoccupied with expenses.

Although SREs have a notorious income on average, their salaries will vary per country, years of experience, and employer. Companies justify SRE salary levels based on the reliability value they bring to the table. Rest assured, the site reliability engineering career is well paved in the compensation field.

Job satisfaction

What affects our job satisfaction can be depicted as code logic as follows:

// jobSatisfaction
if (interestingJob || purposefulWorkActivities || challengingSkillDevelopment || technicalAppreciation) {
    jobSatisfaction = true;
    excitement++;
}

Job satisfaction is another driving force of site reliability engineers, and it has many factors. We usually translate job satisfaction to employee happiness at work. Site reliability engineering leads to job satisfaction when we look at the following profession characteristics: exciting job content, purposeful work activities, challenging skill development, and technical appreciation.

The job content of site reliability engineering spans multiple domains. You can work with developers one day and help systems administrators the next. You may need to assist in redesigning an app to increase its service reliability. As with any generalist model job with technical depth in many subject areas, you will never get bored for sure.

As we will see later in this chapter, SRE work activities have clear business value. They improve not just the service quality, availability, and resiliency, but also the system’s reliability. Reliable services might help with customer loyalty, bringing additional revenue to the service provider. There is a direct relationship between SRE work and business metrics improvement, making their efforts purposeful.

Since site reliability engineering is a cross-technology domain engineering discipline, any skills acquisition is challenging. SREs have knowledge and skills that a systems administrator or software developer doesn’t have. They are required to keep those skills updated and hone them over time. This necessity to keep learning brings the always-moving-forward feeling that may not happen if you only need to master a single product or technology.

The last factor on our list is technical appreciation. According to Boston Consulting Group (BCG) research, appreciation is the number one job happiness factor. Being an SRE, you will aid customers, users, and other technical professionals because of your keen holistic view of the systems. Consequently, technical appreciation for the job you do is common, and who doesn’t like that?

Innovative solutions

The following code gives you an idea of how exciting exploring uncharted terrains is:

If (!solutionExists) {
    deviseNewSolution();
    excitement++;
}

Site reliability engineers are natural trailblazers as they explore new technologies and processes to obtain better reliability and eliminate toil (manual and repetitive tasks that are devoid of value). They face many scenarios and situations that are a first of their kind. Moreover, they are responsible for paving the path for others by documenting procedures in runbooks when none exist. There’s nothing more exciting than devising new solutions or improving existing ones. Imagine how you would feel if they named a technical operating procedure after you.

Nevertheless, SREs want to minimize complexity and reduce technical debt. They don’t create a solution just for the sake of doing it unless it adds value and resolves or prevents events that impact customers.

Good relationships

The following code snippet is a representation of how good relationships are a result of an exciting working environment:

If (excitement > HIGH) {
    motivateOthers();
    relationships.healthy = true;
}

Also, good work environment relationships are one of the top 10 factors contributing to employee happiness. SREs have good relationships in their work environment. The reason is straightforward; they act as integration hubs among different tribes and have the mission to break company siloes. SREs need cooperation from both development and operations teams. They are technical diplomats and have strong communication skills. Since they are usually excited about their work, they tend to socialize more with colleagues and leaders, potentially helping to improve the social environment around them. That doesn’t mean they need to be extroverts with progressive public-speaking skills, but certainly, SREs are good teachers because they are excited and compelled to talk about what they do.

SRE skills

Now that you know what’s in it for you, it’s time to check which skillsets SREs must develop throughout their careers. Site reliability engineers have a good mix of knowledge, skills, and experience that are shared with other roles and those that are unique to them. SREs have technical skills that span the entire solution life cycle, from the design to the manage step.

Figure 1.1 – SRE skills

The preceding Venn diagram shows how SREs acquire skills common to other professions and how SRE skills connect the various steps of the solution life cycle. In essence, site reliability engineers are senior technical resources that follow a generalist proficiency model with good depth at certain areas of expertise.

There’s no consensus in the market about the canonical set of skills for SREs. It would not make sense for this to be the case because as soon as any technology-based skill becomes obsolete, we would need to remodel the whole profession. Instead, SRE core skills should be as technology-agnostic as possible.

We recommend a blend of distinct expertise from the IT architect, software developer, data scientist, DevOps engineer, and systems administrator roles. The proportion of each skill level varies per a multitude of factors. You will need to determine which skills are more in demand than the others, but an organization should have all of them in its toolkit.

Systems thinking

Site reliability engineers have a holistic view of the system’s reliability by understanding the availability, resiliency, and performance of each solution component at both the application and infrastructure levels.

Software engineering

SREs develop code and software. They know how to utilize algorithms and software development techniques such as agile frameworks. SREs need to be proficient enough in instrumenting the app code to increase its manageability and observability. SREs know how to use software development life cycle tools and technologies, including DevOps (continuous integration/continuous delivery – CI/CD) pipelines. They can provide testers with better test cases that consider service reliability targets.

Systems management

SREs know how to manage, administer, and operate systems. They share most of the skills from the systems administrator role on multiple technologies. Their technology knowledge spans the cloud, containers, storage, networking, operating systems, middleware, and databases. They have the skills to implement monitoring, event management, logging, tracing, service levels, observability, DevOps toolchains, and automation of toil.

Data science

SREs work with huge amounts of structured data. They must acquire the knowledge and skills to make sense of such datasets by using mathematical models. SREs know how to analyze data to uncover trends, anomalies, and insights – always from the user’s perspective.

We recommend that every SRE has the following selection of core skills:

Systems thinking for focusing on the reliability of the system
The ability to develop and test software
The ability to deploy and release apps
IT service management
Systems monitoring and observability
Working with DevOps tools and automation
The application of data science for reliability of systems

Although we didn’t explicitly mention security in any of the knowledge domains, enforcing security across multiple layers is present in all of them.

Important note

All fundamental SRE skills are covered in this book’s chapters. The chapters have been organized to optimize your learning journey, so they don’t follow the preceding order of skills. We structured this book based on our own experience acquired from a multitude of site reliability engineering coaching and mentoring sessions.

We provide a manifesto model in the Appendix A, The Site Reliability Engineer Manifesto, that acts as a more structured guide for site reliability engineering adoption, including the fundamental skills. We hope that helps your company in joining the site reliability engineering movement.

SRE traits

Besides what the SREs know and which skills they must develop, it’s relevant to know their other good traits.

Software is everywhere

Site reliability engineers have a software engineering mindset. The idea of approaching any issue as a software problem may be disruptive at first; however, there is a good reason for it. Imagine that you need to restart and verify a system by manually issuing a specific set of commands and parameters many times per week. If you handle it as a software development problem, the solution will be developing and scheduling a simple program or automation to execute this task instead. SREs embrace automation over toil as one of their best tools.

Comfortable to code

SREs are not just able to develop code; they really like doing it. As we will see later in this chapter, they develop code as a frequent activity and main responsibility. It’s not just a question of learning how to code or program when someone asks; SREs always feel confident in constructing good pieces of software.

Change as a constant

Frequent releases of new features, code enhancements, and reliability improvements are vital for any business. SREs are the first to accept calculated risks to provide more value to the system users. They are not risk averse but bring visibility to inherent risks so they can throttle the speed of change. They are always prepared to make progress and go above and beyond for service reliability.

Handle complexity and scale

They are not afraid of complexity or scale. They know that modern workloads are intrinsically complex and must be scalable horizontally and vertically. SREs work with large systems with multiple components running in hybrid multi-cloud environments. They understand the application’s full-stack design, its moving parts, and how they connect to each other.

Problems as opportunities

SREs participate in on-call rotations and schedules to respond to service disruptions. They see incidents and problems as opportunities to learn and advance the reliability of the system. Not just that, they also have the competence to translate technology into business language to measure the impact on users and customers. They advocate for a blameless culture by prioritizing answers to questions, such as how to detect and repair incidents faster next time. They also consider how technical challenges may affect business results.

We have just gone over what makes the SRE persona: their motivations, skills, and traits. Now we are going to understand how site reliability engineers think.

Understanding the mindset and hobbies of an SRE

It’s not rare for site reliability engineers to have a broader and divergent view of their surroundings. We are not saying that SREs are weird; well, they are in a certain sense, as they employ a relentless search for improving reliability in all things. However, we are referring to their mindset and how they approach the world.

In this section, we will explore different aspects of their thought process in the work environment and what they like to do in the job and outside it. We have divided this topic into three sections:

SRE affinity game
SRE guiding principles
SRE hobbies

You may have asked whether site reliability engineering is the right profession for you. Let’s examine that next.

SRE affinity game

Let’s play a game! What do you think your affinity or compatibility is with the site reliability engineering profession? We will present a series of scenarios that SREs face. You need to answer them with either love, like, dislike, or hate indicating how much you see yourself doing it and how you would feel about it. Try to be as honest as possible.

Disclaimer

This is not an anthropological scientific survey based on a human behavioral model or theory by any means. It’s a simple questionnaire to help you understand your own affinity to the SRE job role.

The scenarios are in the following list. Get a piece of paper, write down the question number, and answer it. Good luck!

Your boss asks you to resolve a problem that no one else has ever resolved.
You need to spend a few hours looking through logs, metrics, graphs, and events to verify whether there are any new anomalies that were not detected automatically.
You need to participate in an on-call rotation or schedule where you might be called late in the night to respond to a service disruption that has a business impact.
You need to work on a backend system or software that is not visible to external users.
You need to devise new ways to increase a large system’s overall reliability.
You are asked to work on a large-scale problem, which affects hundreds of users and has dozens of components and dependencies, that runs on a hybrid multi-cloud environment.
You are diagnosing a system problem that is making users from a certain geography unable to access their services, and there is great pressure on you.
You need to approach problems with a selected scientific method or data model to uncover facts instead of guessing.
You constantly ask yourself how you could make things around you better and more reliable.
You need to classify and categorize systems information and functionalities so you can isolate causes from effects.
You must diagnose and fix a system problem by investigating components that are not usually visible by going deep into each component configuration as debugging mode is not available.
You need to design a detailed diagram of how the user interacts with a system or software so you can point out where to observe for symptoms.

After you complete this exercise, assign points to each of the answers. If you replied to a scheme with a love answer, assign 5 points to it. For a like answer, you get 3 points. Dislike has a value of 0, and hate is -3 (negative!). Sum your points across all 12 scenarios to get your score, and check the result against the following list:

Over 34 points: Your affinity is very high; this is the right career for you
From 21 to 34 points: Your affinity is high; you should consider this profession
From 13 to 20 points: Your affinity is medium; this may be a good job role for you
Below 13 points: SRE may not be your best option

This may be a game, but it will have made you imagine yourself in an SRE’s shoes. We have started to understand the SRE mindset, so let’s check what guides them in the convoluted scenarios listed previously.

SRE guiding principles

Everyone has a conjunction of principles (and values) that acts as their compass. SREs also follow a set of values; they embrace guiding principles to advise them on technical decisions and act as a reliability compass.

Google® coined most of those principles in its site reliability engineering books (https://sre.google/books/), but others appeared later in conference sessions at SREcon (https://www.usenix.org/srecon) and blog posts on many websites.

Again, we have selected some of them as canonical guiding principles based on our experience in assisting customers and organizations in enabling site reliability engineering in their IT shops. The following is the set of guiding principles that are rooted in the SRE persona:

Scalable operations
Engineering fidelity
Observability to the core
Well-designed service levels
User-perspective notification trigger
Blameless postmortems
Simplicity

We must remark that such principles are not procedures or prescriptive instructions to accomplish something but guidelines. Don’t worry if you are not familiar with the terminology applied here; we dig into them in a detailed manner throughout the book. Let’s investigate each of them along with their most familiar patterns and anti-patterns.

Scalable operations

The operations team, which includes site reliability engineers, is responsible for managing production systems. They are the first responders for any service disruption when something goes wrong. The scalable operations principle states that this team will not grow proportionally to the system as its load increases. Another way to say that is if the number of active users for the determined service doubles, the operations team size will not double. A more mathematically accurate way to visualize this is through a logarithm growth curve. As the operations team gains technical maturity, eliminates repetitive manual tasks, and adopts automation at large, they will need fewer resources to manage more system load:

Figure 1.2 – A logarithm growth curve

It is worth mentioning that SREs employ a proactive approach as they strive to identify the root cause of issues and devise solutions to detect or prevent problems. The patterns for this principle are as follows:

Identify and eliminate toil whenever possible
Document operational procedures as runbooks
Train operations teams to use and refine runbooks
Adopt automation platforms and automated procedures documented in runbooks at large

The anti-patterns are as follows:

Have linear (or exponential) growth for operations teams when the system load rises
Operational knowledge is tacit or not documented
Automation is the end goal and not merely a way to eliminate toil

Engineering fidelity

This tenet asserts the obvious: site reliability engineers do engineering. Yet it’s not uncommon to see SREs only working on incident, problem, and change management processes. We are not telling you that site reliability engineers don’t get their hands dirty; on the contrary, they do operational and engineering work. This principle exists to guarantee that SREs will have time to excel in both.

The patterns are as follows:

Cap operational work at 50% of the available SRE time. The other half is dedicated to engineering solutions and increasing reliability.
Share some of the operational work with the development team. Sharing 5% of the operational work is usually recommended, so the development team is prepared to take on SRE work.
Send operational overflow work to the development team as they share the same goals.

The anti-patterns are as follows:

SREs only work on operational work, resolving incidents, implementing changes, and running root cause analysis (RCA)
SREs spend most of their time doing firefighting (incident resolution)
Development teams don’t share any responsibilities with the operations team

Observability to the core

Observability is the ability to comprehend the internal state of a system by inspecting its outputs. It extends the monitoring concept by adding layers to expand the system visibility and allows a more proactive posture by detecting anomalies before they become disruptions. This guiding principle craves visibility and discernability of what’s happening inside a system or application by measuring certain signals.

The patterns of observability are as follows:

Observe the system behavior through the golden signals; this can be either four (LETS) or five (STELA) signals depending on the school of thought you follow. The LETS acronym stands for latency, error rate, traffic, and saturation. STELA stands for saturation, traffic, error rate, latency, and availability.
Have monitoring metrics, events, logs, and traces (MELT) at the SRE’s disposal. These are the fundamental data components of any observability platform.
Run synthetic user testing from time to time. This is a method where a bot mimics a user to test system functionality and response times.

The anti-patterns are as follows:

Observe only the liveness of the system components, but not from the user’s perspective. For example, checking that components are running versus checking that users can use the system as designed.
Lack of user experience monitoring. You don’t have visibility of what’s happening in the user interface.

Well-designed service levels

There’s no way to verify whether a service is being delivered to the target user within the expected and agreed-to parameters without established service levels. Part of the undeniable success of site reliability engineering is due to this redefinition of what a good service level is and how we document it. This tenet aims to have not just well-defined service levels but also well-designed ones that measure the system’s reliability.

The patterns are as follows:

Define service-level indicators (SLIs) from the system user angle, then delineate service-level objectives (SLOs) as an aggregation of the former
Set the SLO target to less than 100%, so there’s some room for errors (error budget) between 100% and the SLO target to launch new features and enhance overall reliability
Establish service-level agreements (SLAs), with penalties and fines if they are not met after the measured SLOs
Improve SLOs and increase their targets over time through engineering work carried out by the site reliability engineering team

The anti-patterns are as follows:

Define the SLAs first, then measure the SLOs to see whether they are feasible with the current workings of the system.
Establish a target of 100% for the SLAs or SLOs. This anti-pattern reduces the team’s ability to release new features (or develop system reliability further) as there’s no space for testing them in production. Soon enough, the whole system will become obsolete or non-competitive in its market.

User-perspective notification trigger

Notification is the process of alerting on-call first responders about service or system performance deterioration or downtime. It translates to when a site reliability engineer must be engaged to resolve an incident. This principle states that triggering a notification of an issue should only happen when this issue is affecting the system user. For instance, we never alert an SRE if the CPU load is high, but the user is not feeling any service degradation.

The pattern is as follows:

Alerts are triggered if there are any symptoms at the user level, and if such warnings are actionable, SREs can resolve them

The anti-patterns are as follows:

Alerting noise. SREs cannot differentiate between alerts that are mere informative events and ones that affect the system user.
Lack of alerting. End users engage with the help desk to notify them that there are problems in the system.

Blameless postmortems

Postmortems are in essence root cause analysis (RCA) acts. They receive a peculiar name to avoid running into the same old pitfall: finding someone or something to be blamed (the root cause) rather than improving the system’s quality and learning from mistakes. Postmortems also focus on questions, such as how to detect, respond to, and repair disruption in the service faster than just uncovering the root cause alone. This tenet is one of the hardest to deploy for a new organization if it has been doing traditional RCAs for some time and requires a blameless culture to support it.

The pattern is as follows:

Infinite hows. Ask multiple questions, starting with the term how, to determine enhancements to the system (infrastructure and applications), processes, and knowledge base.

The anti-pattern is as follows:

Go back to traditional RCAs where no progress is made on reliability

Simplicity

This guiding principle was imported from the Agile Manifesto. We can’t explain it better than the manifesto (https://agilemanifesto.org/principles.html): “the art of maximizing the amount of work not done.” In other words, it dictates that site reliability engineers are always looking for ways of simplifying and avoiding unnecessary work. They are eager to eliminate toil, that is, repetitive, manually intensive, or low or no business-value tasks. However, inherently as humans, we tend to complicate everything, so ensuring runbooks are kept easy to observe and readable is a good example of this principle.

The pattern is as follows:

Keep it simple, stupid (KISS) is a proven design principle from the US Navy that says most systems work better if they are simple to use or follow

The anti-pattern is as follows:

Too elaborate processes for SRE work

We just explained our preferred seven guiding principles that site reliability engineers follow in their profession. They are an integral chunk of the SRE mindset. Let’s now cover what SREs do in their free time to overcome learning limitations.

SRE hobbies

Jeremy and I couldn’t agree more about what makes a site reliability engineer rockstar: their hobbies. What you do in your free time for leisure or as a second profession leads to greater levels of conceptual and practical knowledge. The trick is finding a hobby that you have a passion for and that helps in the SRE role.

We can’t tell you what the best-fit extra-curricular activity that will pump up your SRE skills is, but here we list some examples that may interest you, grouped by the skills that they enhance.

Analytical thinking

Site reliability engineers have a good analytical processing capacity. They need to analyze big amounts of data and detect patterns, trends, and anomalies by correlating different data sources. Some engaging hobbies that leverage your analytical thinking are as follows:

Chess: Without saying too much about it, this game has its own set of theories and algorithms. It is a good way to practice thinking multiple steps ahead while focusing on the present.
Board games: There are plenty of board games that make you analyze information to win. And they make it enjoyable and social.
Rubik’s cube: This fun toy is also a good example of simplicity in operation and shape. It presents a complex challenge with a plain design.
Video games: Strategy and role-playing games will train your mind in thinking analytically.

Creativity

SREs need to forge new algorithms for observability. They also need to construct new ways of measuring system reliability, as applications and infrastructure components have an uncountable number of arrangements, technologies, and architectures. This may sound cliché, but thinking outside the box is where site reliability engineers shine. Here are some hobbies that may help with your creativity:

Algorithms development: Although this may be part of your daily work life, you can find fun in it by developing 2D or 3D video games, for instance. Another option is to contribute to open source software projects in the wider community.
Drawing or painting: This is a relaxing and artistic example. It also gets you used to finding inspiration.
LEGO®: An across-the-globe famous construction toy, LEGO makes you think about new forms, shapes, structures, and ways of assembly. It also has a robotics range that gives you programming skills as well.
Internet-of-Things prototyping: How about developing embedded projects with Arduino® or Raspberry Pi® boards? You need to build both the hardware and firmware for the project to come together.
Video games: The ones where you need to build something with blocks and basic structures, such as Minecraft®, are especially useful in nurturing creativity.
3D printing: Author Jeremy is a 3D printing master. Like painting, you can express your art in three dimensions.

Troubleshooting

Troubleshooting is not exclusive to site reliability engineers. Systems administrators must also figure out why a service or system is down and how to repair it. However, SREs use systems thinking and scientific approaches to troubleshoot differently. You need to train your mind to resolve problems logically and calmly, and you’re going to need it. There are plenty of hobbies that can stimulate you to excel in this area. Let’s list some of them:

Crossword or jigsaw puzzles: People are addicted to this type of entertainment. It’s an excellent choice to keep the mind sharp and trained
Sudoku: This was a trend not long ago, but it is still an excellent way to polish up troubleshooting skills.
Video games: The ones full of puzzles, such as Portal, as especially good for troubleshooting practice.

We have now covered the aspects of the site reliability engineer persona. Next, we will look at what makes site reliability engineering professionals unique by comparing them to other roles and listing responsibilities and activities.

DevOps engineers versus SRE versus others

This is one of the most frequently asked questions we receive from customers and organizations: how does the site reliability engineering profession differ from other existing technical roles? We already talked about how SREs are the connection between the different steps of the solution life cycle. Here, we’ll focus our discussion on the DevOps engineer role, and later, we’ll broaden it. We have split this discussion into two sections:

DevOps and site reliability engineers
Software and site reliability engineers

DevOps and site reliability engineers

Google described the relationship between DevOps and SRE with a famous subtitle in their The Site Reliability Workbook publication:

Class SRE implements interface DevOps

This statement is an elegant way to define this link and refers to Java programming. It implies that site reliability engineering describes and deepens the implementation of whatever DevOps is. Moreover, we can say that site reliability engineering has commonalities with DevOps as a logically derived conclusion. However, what exactly does site reliability engineering implement from DevOps, or what are the differences between a site reliability engineer and a DevOps engineer? We have visualized these similarities and divergences in an infographic as follows:

Figure 1.3 – An infographic on SRE and DevOps

Notice that they have shared values. Both SREs and DevOps engineers require those values in the orange (bottom right in the above diagram) box. In the bottom-left table, you can see the difference between those roles. Typically, site reliability engineers resolve operational problems by applying the right software engineering disciplines. On the other hand, DevOps engineers resolve development and delivery pipeline issues with systems management techniques mainly by using automation and infrastructure-as-code. They also concentrate different levels of effort on distinct phases of the solution life cycle, as depicted in the infographic.

It’s not rare to hear that DevOps is a shift-right transformation while site reliability engineering is a shift-left one. That implies moving from the left (development side of the equation) to the right (operations side of the equation), and vice versa. Another term we hear a lot is DevSecOps, which has the addition of security. Since security has always been implied in these roles, we think including new letters in the middle is confusing and redundant.

SREs and DevOps engineers are, in our opinion, different sides of the same coin. They should be more like best friends forever than opposing roles as they share values. Let’s check how SREs fulfill those values from the five main areas of DevOps:

Reduce organizational silos: SREs use the same tooling as developers or DevOps engineers. They also share objectives and performance metrics with them.
Accept failure as normal: SREs embrace risks using the error budget for new features. They quantify failure through SLIs and SLOs. And they run postmortems in a blameless culture.
Implement gradual changes: SREs work to increase reliability, and more reliable systems allow more frequent changes and releases.
Leverage tooling and automation: SREs eliminate toil by automating operational tasks at a constant pace.
Measure everything: SREs measure reliability by implementing MELT data and observability. They also have ways to identify and size toil.

Software and site reliability engineers

Another frequently asked question is how site reliability engineers differ from software engineers (SWEs). The short answer is simple: they have the same core skills but specific work scopes.

What are SWEs? SWEs design, engineer, and architect applications using modeling languages and requirements analysis techniques. They implement an integrated development environment (IDE) and develop code for use cases using one of the multiple available programming languages. They create test cases and testing suites. Also, they integrate software and service components and handle their dependencies. SWEs work with many software development life cycle tools and processes.

Site reliability engineers may execute the same activities, but they intend to improve reliability when doing so. For instance, developing code for an SRE translates much more to instrumenting the application code, so it generates more logs, than coding a use case. Also, SREs treat operations as a software problem and see daily systems management tasks as possible software coding opportunities. Besides that, SREs have other core skills, relating to systems thinking, systems management, and data science.

Indeed, an SRE could become an SWE and vice versa, and that leads us to another principle that we find in the Google materials.

Common staffing pool

Another principle is hiring site reliability engineers and SWEs from the same staffing pool. This principle works well for companies where most employees are software developers and engineers, and having a shared pool means that site reliability and software engineering job roles are interchangeable. However, this principle may be much more challenging for enterprises with a mix of systems administrators and developers. Hence, we left it out of our list in the previous section.

We could compare the SRE’s unique profession to many others, but we limited this topic to the most common comparisons. SREs are not architects, developers, systems administrators, or data scientists; they are more than all of these roles combined. Up next, we are going to understand the primary responsibilities of an SRE.

Describing an SRE’s main responsibilities

We hope the SRE job role mission and scope are less foggy at this point. As an SRE, what would you be responsible for? In this section, we will investigate the most trivial duties that SREs are accountable for. We’ve divided these responsibilities into two sections:

Operational work responsibilities
Engineering work responsibilities

Let’s start by reviewing the operational group first.

Operational work responsibilities

Site reliability engineers have work duties related to the process of managing systems. Such tasks are called operational work. SREs are not just accountable for operational work together with the operations team, but they also have the authority to execute their management processes.

First, they are responsible for the ITIL® processes, including incident, problem, and change management. That means they actively participate in on-call schedules for critical services downtime as first responders. They need to isolate the faulty components of the service, troubleshoot the causes of the component issues, repair them or provide a workaround, reestablish the affected service to nominal performance, and verify whether the service has been restored from the user’s perspective. After significant service disruptions, SREs must determine their root causes and contributing factors. They implement change requests to the systems, backing services, delivery pipelines, integrations, infrastructure, and applications.

Second, they are accountable for maintaining systems, services, applications, and infrastructure. They may need to patch a bug into production or assist the development team. SREs may have to deploy a new software version using a canary release, A/B testing, or blue-green deployment.

Third, SREs have the responsibility of taking care of the observability platform. That includes installing, configuring, maintaining, and monitoring the observability tools. Yes, we monitor the monitoring.

Engineering work responsibilities

SREs do engineering work to reach higher levels of availability, resiliency, performance, quality, and scalability on a system. They work on each configuration item or component to increase its reliability. The overall system delivers more trustable services and SLOs by handling each component reliability index.

Site reliability engineers are responsible for reliability metrics, such as the mean time to detect (MTTD) and mean time to repair (MTTR). MTTD indicates how fast the monitoring system can detect a service problem or an anomaly that will lead to a problem if nothing is done. MTTR indicates how swiftly an incident is repaired after it’s detected. Those metrics make SREs accountable for the effectiveness of the observability platform and tools, and the runbooks documentation.

The mean time between failure (MTBF) is another reliability metric under SRE accountability. That indicates how much time it takes for a system failure. SREs must adopt the blameless postmortems principle to improve this metric every time a failure happens. And that translates to multiple reliability enhancements to different parts of the system as a result of these postmortems.

SREs are accountable for toil management. The less toil we have in systems management, the better the metrics mentioned previously. Site reliability engineers work tirelessly to detect and eliminate repetitive tasks devoid of business value.

We described the ordinary responsibilities of an SRE with the intent of giving you an idea of what to expect in this career. Of course, this is not a comprehensive list of duties or intended to suggest a constraint to their responsibilities. As long as they work to fulfill the guiding principles, they are doing SRE work. We are going to review which activities SREs execute daily next.

An overview of the daily activities of an SRE

Now that we have examined SRE responsibilities, it’s time to check what you, as an SRE, should be performing on a frequent basis. There’s no better way to understand a profession than by asking what someone does in it. When you go to a job interview, you probably want to know the activities a person in that position will carry out. SREs will have a list of assignments as sticky notes on their displays. We have separated those notable activities into two sections:

Reactive work activities
Proactive work activities

We’ll start by understanding reactive activities.

Reactive work activities

SREs execute many tasks that don’t lift (or shift) system reliability directly; they are usually operational types of work. Nevertheless, those activities either lessen the service downtime or mitigate risks. Examples of jobs that SREs perform daily in this category are as follows:

Repair or restore a system or multiple services to their original state
Follow and execute instructions from a runbook (standard operating procedure) during an incident to diagnose the application
Implement a change request to apply a patch to a software component
Attend a meeting to run a postmortem with system administrators and developers about the recent service or system outage
Install a new Kubernetes cluster for a new application according to the development team’s specifications and enable monitoring of it
Configure a new cloud-based service for a new application following the architecture design and include it in cloud monitoring
Deploy a new software release to VMs and execute the testing scripts

Proactive work activities

SREs also carry out jobs that improve the quality, scalability, observability, manageability, resiliency, or availability of a system or service. Since those tasks increase the reliability levels of specific systems or services, they are considered proactive and mostly engineering type of work. Such assignments affect toil and technical debt. Examples of this category are as follows:

Maintain a runbook on how to diagnose problems with a specific application
Design and develop an automaton to execute procedures previously documented in a runbook automatically
Establish, together with the DevOps team, the release strategy, such as a canary release, A/B testing, or blue-green deployment
Work with the SWE to add management code to the application so SREs can instruct the application to do self-administration or self-healing operations
Work with the development team to adopt an immutable infrastructure philosophy into the application-building process
Instrument the application code to increase its observability with logs and traces
Design and implement observability to obtain good metrics, events, logs, and traces from a critical application

Note

Site reliability engineers perform many more activities than the ones listed here. This is not a comprehensive list; the only intention is to show you how SREs work across multiple dimensions and aspects of systems and services.

We listed what an SRE does frequently. We wanted to give you a good sense of their day-to-day activities and how it differs from other roles. Again, this is not a complete or closed list. We want to close this chapter by telling you who our SRE rockstars are.

People that inspire

We want to finalize this chapter by pointing out other SREs that have inspired us and have been encouraging the wider community. We couldn’t even think about starting this book without the work of the parents of site reliability engineering at Google. We are immensely grateful to them. Site reliability engineering would probably not exist outside Google if they had chosen not to share their thoughts, principles, techniques, and practices through the site reliability engineering foundation books. They are mandatory reading for anyone following this career path. If you haven’t read them yet, please check out Google’s site reliability engineering books at this site: https://sre.google/books/.

We want to recognize a few other rockstar SREs that have really made a difference in our professional lives as individuals. They are trailblazers of site reliability engineering outside Google.

Jeremy’s recognition – Paul Tyma, former CTO, LendingTree

In technology, finding your way can be difficult. The constant struggle of being an SRE leads us into discussions of what went wrong; often, we have to say what some don’t want to hear – that a negative thing happened due to what a person or team did or didn’t do. We are, in fact, often the bearers of bad news. Paul opened the door for me to become an SRE, and we drove a great reliability revolution together. Most importantly, he taught me that there is a balance to all things, and we have a choice in that balance. And what we often consider a responsibility or duty can have its limits.

Rod’s recognition – Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl

Ingo and Gene triggered a small revolution inside IBM by designing and deploying site reliability engineering principles, practices, professions, and methodologies to its organizations across the globe. They first transformed many internal teams to adopt such extraordinary tenets, then later, they helped external customers in doing the same. Of course, they didn’t accomplish this alone, but they were (and are) paramount examples of technical executive leadership. They shaped the site reliability engineering profession from within IBM, which later spread to Kyndryl after its spin-off.

Summary

In this chapter, we learned what the site reliability engineering persona looks like – what they know and how they think. We divided their mindset into smaller traits so you can understand what’s expected of you on this journey. Also, we explained why the site reliability engineering profession is unique. Then, we entered the practical knowledge sections and looked at their main responsibilities and daily activities.

By now, you should be able to explain why someone would become an SRE and the path for that. You can differentiate the site reliability engineering job role from others and understand what the site reliability engineering persona is. Finally, you know the typical duties of SREs, and the activities most SREs perform. As we close this chapter, we hope everything we say here resonates with your career aspirations.

In the next chapter, we dive into the world of numbers and understand how they are intrinsically part of the site reliability engineering domain. You will learn why SREs see production systems like no one else.

Becoming a Rockstar SRE

SRE Job Role – Activities and Responsibilities

Making this journey personal

SRE driving forces

Money

Job satisfaction

Innovative solutions

Good relationships

SRE skills

Systems thinking

Software engineering

Systems management

Data science

SRE traits

Software is everywhere

Comfortable to code

Change as a constant

Handle complexity and scale

Problems as opportunities

Understanding the mindset and hobbies of an SRE

SRE affinity game

SRE guiding principles

Scalable operations

Engineering fidelity

Observability to the core

Well-designed service levels

User-perspective notification trigger

Blameless postmortems

Simplicity

SRE hobbies

Analytical thinking

Creativity

Troubleshooting

DevOps engineers versus SRE versus others

DevOps and site reliability engineers

Software and site reliability engineers

Common staffing pool

Describing an SRE’s main responsibilities

Operational work responsibilities

Engineering work responsibilities

An overview of the daily activities of an SRE

Reactive work activities

Proactive work activities

People that inspire

Jeremy’s recognition – Paul Tyma, former CTO, LendingTree

Rod’s recognition – Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl

Summary

Further reading