Q: What inspired you to write System Design Guide for Software Professionals? What key concepts or gaps did you aim to address?
Dhirendra:
I’ve been in the industry for more than two decades, working across startups and large organisations on complex, large-scale system designs. Around seven or eight years ago, I started teaching system design. There were two reasons for this: first, I wanted to give back to the community; second, moving into management was taking me away from core technology, and teaching helped me stay connected.
I always thought I should write a book but never quite found the courage or time. When Packt approached me—actually, one of my mentors recommended my name—I saw it as a great opportunity. However, I insisted on having a co-author. I didn’t want to take on the whole process alone. Tejas was recommended, and we clicked immediately. He’s been a great collaborator.
Our primary motivation was to go deeper into system design concepts and also prepare senior candidates for system design interviews. These interviews not only decide whether you get hired but also determine your levelling within a company. That was our intent when writing the book.
Tejas:
Like Dhirendra, I’ve worked at companies like Box and Netflix, where I’ve seen how software systems can fail in unobvious ways as they scale. Even senior engineers often struggle to bridge the gap in designing scalable systems. That was one of the motivations behind exploring this field further.
We noticed that system design is often treated as an afterthought—mainly something you brush up on for interviews. But Dhirendra and I agreed that this shouldn’t be the case. We wanted the book to be more than a set of interview questions and answers. It’s meant to serve as a reference that explains why certain design choices are made, and how to think about architecture beyond the interview—to actual implementation in scalable organisations.
We aimed to demystify distributed system principles and avoid the trap of just providing a checklist. Of course, there’s much more we could have written, but we felt this book lays a strong foundation to build upon.
Q: What best practices do you apply in big tech to approach scalability and system robustness?
Tejas:
The first and most important principle is designing for failure. At Netflix, we assume the worst-case scenario—that everything will eventually fail. This mindset led to the creation of Chaos Monkey, which intentionally disrupts services to ensure systems are resilient enough to recover.
Some of the key best practices we follow include:
Automating routine tasks: This reduces manual effort and human error.
Monitoring and observability at scale: We invest heavily in observability to ensure we can trace issues through our complex microservices architecture.
Explicitly defining boundaries: It’s critical to be clear about how many users or requests a system can handle. Most failures stem from faulty assumptions about system capacity.
Incremental rollouts: At Netflix, when deploying a new algorithm or feature—say, a recommendation engine—we roll it out to a small cohort first. We gather feedback, monitor performance, and only then scale it to larger user groups. This reduces risk and allows for adjustments along the way.
These practices ensure that even when something goes wrong, the impact is contained, and recovery is swift.
Dhirendra:
I completely agree with Tejas. When I first heard about Chaos Monkey, I found it fascinating—this idea of deliberately causing failures to test system resilience.
One example from my experience at Yahoo: An engineer once dismissed a corner case, saying it would only happen once in a million. The chief architect responded, “At our scale, that happens every hour.” That really stayed with me. Scale changes everything. Small assumptions that hold in low-scale systems can completely fall apart when you’re dealing with millions or billions of users.
Another principle I encourage in my teams is thinking beyond launching features to landing them successfully. Launching is when you complete the code and push it out. Landing is about ensuring the feature operates smoothly in production, is maintainable, and doesn’t create operational burdens. I tell my engineers to focus on adoption, operational challenges, and long-term performance.
Automation is crucial here too—not just for deployments, but also for monitoring, alerting, and scaling. We use infrastructure-as-code tools like Terraform and Kubernetes to define the expected state of the system and let the system evolve accordingly. But these automated systems must be well-tested to ensure they work reliably.
Q: During system design interviews, what do you look for in candidates? What makes someone stand out?
Dhirendra:
System design interviews typically become relevant after around three to five years of experience. For fresh graduates, the focus is more on coding and algorithms. But for more experienced candidates, system design becomes crucial—not just for hiring, but also for levelling.
In these interviews, I look for structured thinking. System design problems are open-ended and ambiguous—there’s no single correct answer. The way a candidate approaches and structures the problem, the questions they ask, and how they break it down are all important signals.
Trade-offs are a key area I assess. It’s easy to choose between a good and a bad option. But at the senior level, you’re often choosing between two good options. I want to understand why a candidate makes a particular choice. What’s their reasoning? How do they evaluate different approaches under real-world constraints?
I also like to dig into candidates' past projects—exploring trade-offs they made, how they handled failures, and what lessons they learned. Ultimately, I’m looking for engineers who can make practical, informed decisions in real-world scenarios.
Tejas:
I completely agree. For me, structured thinking and the ability to handle trade-offs are critical. But I also make interviews conversational. I expect candidates to ask questions and challenge assumptions. For instance, they should ask me how many users we’re designing for, or clarify the must-haves versus nice-to-haves. That’s what happens in real-world system design.
I keep the problem intentionally broad to see how candidates scope it down. If they go too broad, they risk staying shallow. If they narrow it down, there’s an opportunity to go deeper into specific trade-offs.
One area I like to probe is database selection. I’ll ask if they’d choose SQL or NoSQL and why. Then I might introduce a scenario where the user base grows tenfold—how does that affect their choice? Another area I like to explore is consistency models—strong consistency versus eventual consistency, and how they handle CAP theorem trade-offs.
Q: Where do candidates typically struggle, and what advice would you give them?
Tejas:
One common struggle is jumping straight into diagrams without clarifying the problem. The first five or ten minutes should be about asking questions, defining the functional and non-functional requirements, and scoping the problem. Many candidates skip this and start designing based on assumptions.
Another issue is lack of structure. Some candidates jump between different parts of the system without a coherent plan. Others over-engineer certain areas and lose track of the bigger picture. My advice: start with a simple, working solution. Once that’s established, layer in complexity as needed.
Dhirendra:
I’ve seen similar patterns. Many candidates don’t spend enough time clarifying and scoping the problem. They see something familiar and jump straight into designing everything that comes to mind. But without clear boundaries, they often design something different from what was asked.
Time management is another pitfall. Some candidates get so caught up in one area that they run out of time to cover the core pieces. My advice: practise with mock interviews and time yourself. Make sure you pace the conversation and don’t get stuck in the weeds.
Another key point is listening to interviewer hints. As interviewers, we want candidates to succeed. If we suggest moving on or exploring a different area, it’s important to pick up on that. Ignoring those cues can limit your opportunity to showcase your thinking.