SOC Basics – Structure, Personnel, Coverage, and Tools
In this chapter, we will cover the landscape of what your average security operation center (SOC) looks like. We’ll discuss the structure of the specific roles within the SOC and possible sub-teams that can feed into or be part of the SOC environment. We’ll discuss strategies for alert triage, creating detections, incident response, and other important functions such as “trust but verify,” and how these functions can promote cross-team collaboration and apply to all aspects of the business. Having a strong understanding of the SOC can be critical to applying the various aspects of the ATT&CK framework. This will also allow you to evaluate the SOC environments that you might work in or interact with, and suggest possible changes, improvements, or areas for expansion. This chapter is comprised of the following sections:
- SOC environments and roles
- SOC environment responsibilities
- SOC coverage
- SOC cross-team collaboration
SOC environments and roles
The SOC environment is one of the most critical teams that comprises your organization’s security posture. At the same time, there is no true one-size-fits-all for any SOC environment, which is part of the fun when evaluating yours for improvements. So how do you evaluate how your SOC should be set up? Well, that depends on the purpose of your organization. To test that theory, we will discuss a few hypothetical companies and the make-up of a few different SOC teams that I have seen first-hand. We’ll also talk through important traits for any SOC employee and drill down to specific roles.
SOC environments, as mentioned, all look different, but at the core, they are responsible for alert triage and incident response. In some cases, they can also be responsible for red team activities, security engineering, and network monitoring. That means that your basic roles are SOC analysts, incident responders, and SOC managers, but again that changes depending on scope. Expanded SOC environments might have network operations center (NOC) managers, NOC analysts, security engineers, threat intelligence analysts, or red team operators.
The first hypothetical SOC that we will evaluate is for a company called Blair Consulting. Blair Consulting has a SOC that is responsible for the management of security tools for building detections and for tabletop exercises, and they also have a hybrid environment, which means that there is a combination of on-premises systems and cloud systems. A SOC at that company would be comprised of a SOC manager, incident responder, security engineers, a red team specialist or engineer, and a cloud security engineer. The specific number of people would depend on the size of the organization, but as a rule of thumb, I try to have 1 incident responder for every 200–300 employees. That number primarily takes effect after you have a base set of employees for your team, which is done in the scaling-up period. This also changes if you are a public or private company because if you are public, or depending on the types of data your organization has, you might have reporting requirements for incidents, such as reporting to credit card bureaus in the case of a breach that concerns payment card industry data.
As you can see, any environment will require different teams and personnel, and no one environment looks the same. You should base it on the resources your organization has, the risks that have been identified as high and critical for your organization, and the projected growth/scalability of both the team and the organization.
SOC environment responsibilities
A SOC environment can have varying responsibilities. At the core, you have an incident response and alert triage. To accomplish comprehensive incident response, you want to have a strategy for 24/7 coverage, which can be accomplished by having a larger team and having shift patterns, partnering with a managed service security provider (MSSP), implementing a follow-the-sun model, or using an on-call roster. The option that you choose depends on the resources and infrastructure your team has. For example, if the organization has multiple offices located around the world, you would want to implement a follow-the-sun model, where you have multiple SOC analysts located at each office so that they can work their respective 9–5 type shift, and due to the various time zones and the addition of a few weekend shifts, that will give you the 24x7 coverage envisaged by the follow-the-sun model. The advantages of using an MSSP would give you the advantage of having a smaller in-house team while having support for alert triage and 24x7 coverage, but of course, that comes with a typically high price tag.
If you do everything out of one office and in multiple shifts, you would typically see 4 different shifts of 12 hours, A Day, A Night, B Day, B Night, where in 1 week, members assigned to A Day and A Night work for 4 days with 12-hour shifts, and the ones assigned to B Day and B Night work 3 days with 12-hour shifts, Each week, it would flip for the team that worked weekends.
The following chart will help you envision this:
Figure 1.1 – 2-week 4-shift rotation schedule
The benefit of that is that in a 14-day time period, you only work 7 days, but again they are 12-hour shifts. You’ll see this type of setup in a lot of government SOC environments. The final option is to just use an on-call roster. In theory, if a high-severity alert is triggered or someone needs to trigger the incident response process, they would use a call roster or a tool such as Ops Genie, which would page whoever is on call. While this is the cheapest option for coverage, it does come with a higher level of risk that alerts will go untriaged, or incidents might take longer to contain or mitigate. In my experience, I’ll typically start a new SOC environment following the on-call method while processes are established, and then as the organization and SOC mature, I’ll either contract with an MSSP or hire enough staff to use the follow-the-sun method.
In addition to incident response, another core responsibility is alert triage. To have alerts, you need to also establish detections that can be completed by SOC analysts or dedicated detection engineers. Alert triaging can sound boring and monotonous, but it is a situation where you are constantly solving a different puzzle. It’s the responsibility of the SOC to create alerts to be able to effectively analyze the logs to find potentially suspicious actions or traffic. The purpose of that is that it is not effective or scalable to be analyzing raw logs all day; the rate at which suspicious actions would be missed is astronomical. Therefore, in theory, it is a key responsibility to have the ability to create complex alerts, which ideally pull in information from various data sources to essentially complete the first few steps of triage for your analysts. You also need to test alerts and review them regularly for efficiency because you do not want your analysts to spend all day triaging false positive alerts. Another aspect that can help with this responsibility is setting up some type of case management system to ensure that analysts do not duplicate efforts by triaging the same alert, and that would allow you to track metrics such as time to ticket, time to triage, and time to mitigation, which can help you drive initiatives for your team.
Threat intelligence analysts can be like security engineers or other types of roles in the SOC, the same way that a SOC analyst might perform some threat intelligence activities such as conducting operational intelligence research, which would provide contextual information for potentially suspicious IPs, domains, URLs, among other types of activities. A threat intelligence analyst conducts research on a larger scale. Some of their responsibilities might include integrating threat feeds such as ones from Abuse.ch, FireEye, and GoPhish (there are literally hundreds of both public and private threat feeds) or creating a deny list of hashes or domains based on triage findings, which can be applied to your organization’s firewall or endpoint detection and response tools. Another responsibility of a threat intelligence analyst is to stay current on all cyber threats, advanced persistent threat groups, and vulnerabilities, which may include writing threat reports on vulnerabilities that apply to the organization or industry.
The NOC is a part of the SOC or can be its own team. The NOC contains team members called NOC analysts, who have similar responsibilities to SOC analysts. They are responsible for monitoring networks; however, they can be less security-focused. A great example of a NOC is where it monitors smaller managed networks and remediates or investigates changes that might not align with a golden image, or reviews whether a change management process was followed properly for any network changes. If a network change might be suspicious, the NOC would work closely with the SOC to triage and assist where needed in the incident response process.
Security engineers can have a multitude of responsibilities. First, they can assist with creating detections and alerts for analysts. Next, they work heavily on the concept of “trust but verify.” This means that you trust employees with the best of intentions, but review configurations to ensure that they are secure. One common example is that when testing in a development environment, it’s very easy to stand up a new Amazon Elastic Cloud Computer (EC2) instance or create a new security group and in the interest of time, leave its firewall rules wide open to the internet. A security engineer would ideally have alerts set up to trigger when a new overly permissive group is stood up and work with that initial person to restrict the group as needed for the project.
Red team operators similarly can be their own team and work closely with security engineers and the SOC or fully be part of the SOC. Red team engineers or operators typically work on offensive operations tasks such as penetration testing, purple team testing, and assisting with trust-but-verify. The penetration testing responsibility and purple teaming go hand in hand. A purple team exercise is where a penetration test is being conducted via a set of pre-specified use cases, and the SOC analysts are notified when the testing is going to begin, its status, and so on. That way, it is a collaborative effort to test out any detections and coverage and provide an opportunity to improve.
Managers of sub-teams or of the SOC, or really any leadership role as part of the overall SOC, needs to be unique. They of course must handle administrative tasks, such as capturing metrics, creating reports, and personnel management, but they also must have a strong technical background to be effective. They must plan out a roadmap that addresses the risk of the organization, allows the team to grow and follow a scalable model, while also supporting the team members. To provide for proper growth and planning, the managers should be able to project the next 6 months of roadmaps. I recommend planning out no more than 60% of your time for projects depending on the role, to account for interrupts such as incidents or unforeseen complications and be flexible in the amount of work or types of projects projected to allow for necessary pivots.
SOC environments clearly have many varying roles and responsibilities, which really depend on the specifics of your organization and environment. You should identify the expected responsibilities of your SOC and ensure that you have enough personnel with the correct skill set or potentially provide training to develop the needed skill sets to fill any role. For scaling purposes, I typically try to add one SOC analyst and one security engineer per 500 employees, 1 NOC analyst per 5,000 network assets, 1–3 total red team engineers, 1 threat intelligence analyst for every 1,000 employees, 1 manager for every 5–10 SOC employees, and a senior manager or director as needed. Of course, this is just a general rule of thumb and can be tweaked to meet your needs.
Expanding coverage and using your data effectively is critically important to any SOC’s ability to carry out its responsibilities. With a lack of coverage or visibility, the SOC will have nothing to monitor or detect, and your organization will have a significantly higher risk level because of it. That’s why it is an important factor within a SOC to determine the current level of coverage, where the gaps are, how to prioritize the gaps, and how to fill those gaps by increasing coverage.
The first stage is to determine what level of coverage you currently have. Regardless of whether your organization is a newer SOC or a more established one, I make it a point to baseline coverage before making any other roadmap decisions. To do so, I take note of all data sources that are being ingested into whatever security information event management (SIEM) is in use, such as Splunk or ELK, and I write down and organize all alerts by function. In addition to this, I review or create a risk registry that categorizes the highest risks that face the organization and, if time permits, conduct a purple team exercise. As explained in the SOC environment responsibilities section for a red team engineer role, and will be further explained in future chapters, a purple team exercise is a joint project where offensive operations are carried out to test the visibility and efficiency of alerts in place. This can also be useful in proving the negative, which means proving that attacks can be conducted without any visibility, which helps push the importance of increasing coverage.
After you have determined where the gaps are, you need to prioritize them and create a plan on how to fix them. One of the strategies for prioritizing is to list out the types of logs and visibility each missing source would give. For example, if you are not capturing logs from your organization’s Virtual Private Network (VPN), and it’s a remote work environment, then you have limited visibility into User browsing logs. You would then write out the risks or potential attacks that could be discovered with those logs and write down any mitigations. For example, while you might not capture user browsing logs but capture endpoint detection and response logs, you’ll still have some visibility into the applications used, and downloads, so that might lower the risk of not capturing those initial logs. Another strategy is after writing down the missing sources and suspected size of the logs, to prioritize them based on an effort to ingest them. For example, if you can download an app on your SIEM, utilize an API key and creds, and everything is managed in-house, then as long as the size of the logs isn’t the problem, you should be able to easily configure the new ingest. I typically recommend picking a mix of harder-to-implement logs to start the process for ingest while also implementing a few that are considered a quick win. Another way to prioritize coverage gaps is based on what you are prepared to write detections for or base it on team expertise. That way, you know whatever source of data that gets added will immediately be used to increase coverage because, in theory, you don’t want to waste licensing space, time, or money if you can’t take advantage of the additional coverage.
Filling any coverage gaps after prioritization can be done by expanding your ingest license for your current SIEM tool or using a data lake to capture additional logs without ingest in your SIEM tool. Therefore, you’ll still be able to run analytics and only ingest absolutely necessary data within your SIEM tool. Expanding your resources and current license seems like the simplest path, but a common issue with that path is that it can be cost-prohibitive. That’s where other solutions such as log tiering or using a data lake come into play. By implementing a different technical solution, such as a data lake through providers such as Snowflake, you are able to have logs transferred and collected in a centralized location, and then run analytics over those logs, and only ingest detections on a subset of the logs. That way, you have the coverage, but your ingest costs are lower. One of the downsides to this approach is that you need to implement a new technical implementation, connectors, and so on, and that, of course, has its own costs associated with it. Of course, any approach that is taken to fill a coverage gap is reliant on proper cross-team collaboration.
Coverage is always an ongoing battle as your company grows and gains capabilities. It’s important to regularly evaluate your coverage for improvements and to document any gaps, as this will help your SOC team mature. As mentioned, there are multiple ways to evaluate coverage and identify gaps, which will allow you to make an action plan for how to remediate those gaps.
SOC cross-team collaboration
This quote applies to the SOC environment as much as it does to most of life. One of the key functions that the SOC needs to be involved in will be promoting cross-team collaboration. This can be applied to all individual teams and functions of SOCs. For example, an analyst might triage an alert about suspicious DNS calls. While they can speculate on whether it’s DNS poisoning or a possible misconfiguration, they really need to reach out and work with a networking team to be able to fully triage and get to the root cause of what caused that alert. Similarly, in a trust-but-verify role, a cloud security engineer in the SOC would need to work with the infrastructure team, or potential development operations teams, which will be the ones standing up the initial cloud resources.
One of the key responsibilities of the SOC is “trust but verify,” and that relies on the principle of least privilege. The reasoning is that you want someone independent of the change to have the ability to validate the security behind it, so there is a lack of conflict of interest. In general, you would want security implemented by design, and the SOC or subsequent teams should coordinate to discuss security implementations prior to implementing any changes, which would shorten the validation period.
Other responsibilities, such as purple teaming exercises, are an inherently collaborative effort because they require multiple personnel of different roles to work together. As mentioned in the previous section, they primarily require a red team engineer to work collaboratively with other teams. We’ll also cover purple team exercises in depth in future chapters. Similarly, incident response efforts regularly require cross-team collaboration work in the mitigation stages. The reason is that in the case of a distributed denial-of-service (DDoS) attack, the SOC team might not be the one to implement firewall rule changes; they might have to work with a network engineer to implement changes. Or if the SOC works in an engineering environment, they might have to work with the respective engineering teams to implement CAPTCHAs, restrict logins, and so on.
There are many cases where the SOC is only able to conduct their work through effective cross-team collaboration. That effort provides a more security-informed workforce, and while issues might arise due to differing opinions on technical implementations, restraints, or even personalities, it’s critical that the lines of communication remain open and that all teams continue to work together.
Cross-team collaboration is critical for any team, but especially for a SOC. The reason is that the SOC must work with time-sensitive and potentially damaging incidents and data. The SOC is also responsible for maturing the security posture of the organization and ensuring that the principle of least privilege is in place; that can only be accomplished through cross-team collaboration. There are many times when a suspected incident turns out to be a misconfiguration, and by reaching out to other teams such as a network engineering team, your SOC will be more efficient and have the ability to triage more alerts.
To summarize, no two SOCs will ever be the same, and this can be inherently complex at first. Once you break it down and identify the needs of the organization, you will be able to determine what roles are needed and what the minimum size for your SOC is so that you can scale appropriately. Once you have at least some of the roles in place, you must evaluate your coverage so that you can identify your coverage gaps, and what the risks are for your organization. From there, you can remediate those gaps and risks through increased logging, increased detection engineering, or making technical changes. The increase in coverage is typically only accomplished through cross-team collaboration, which also makes the SOC team more efficient and ensures that proper access controls such as the principle of least privilege remain in place.
In the next chapter, we’ll take a deeper look at identifying risks through creating a risk registry, making a test plan, and conducting a purple team exercise, and hear from an industry analyst about common gaps and shortfalls. All of this will involve practical information, which you will be able to implement in your SOC environment to help increase your organization’s security maturity.