Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

CloudPro

76 Articles
Apramit from Packt
01 Jun 2026
Save for later

MCP's security crisis isn't new. It's just faster.

Apramit from Packt
01 Jun 2026
It's the non-human identity problem you already know, surfacing where you didn't lookJune 20th 9 AM EDT| EXCLUSIVE OFFER 40% OFF - USE CODE LIMITED40BOOK YOUR TICKETS150+ engineers from 30+ countries attended our last cohort.The biggest Networking Automation experts, William Collins, Director of Tech Evangelism and John Capobianco, Head of AI & Developer Relations at Itential and are here! Get your tickers right away! Offer ends soonJune 11th 11:30 AM EDT | EXCLUSIVE OFFER 40% OFF- USE CODE SAVE40BOOK YOUR TICKETSIt's a hands-on cohort for teams already running cloud-native platforms who now have to evolve them into AI-native ones without weakening the controls underneath.Great experts & panelists joining along with other platform engineers, SREs, architects, and the leads who own those decisions.(Ps- Free Platform Engineering for Architects e-book exclusively for you!)Hi, Shreyans here.Before wegetinto this week’s issue, I wanted to share a small but important update.Apramitwill be taking over as the new editor-in-chief ofCloudProfrom this issue onward.Apramithas been withPacktfor over2years, working closely with technical content, authors, and practitioner-focused communities, and he brings exactly the kind of editorial judgment this newsletter needs: useful over noisy, practical over hyped, and grounded in what engineersactually careabout.Over to you, Apramit.Hi,I’mApramit.I’m excited to take over as editor-in-chief of this newsletter and grateful for the direction already set by Shreyans. My focus will be simple: to keep this newsletter useful, sharp, and grounded in what readers actually need. We’ll continue to look past the noise, ask practical questions, and make space for clear, thoughtful conversations around technology, publishing, and the people building with it.Now, let's continue!MCP's security crisisisn'tnew.It'sjust faster.Ifyou'vestood up an MCP server in the last year, this one's worth two minutes.The reportingaround agentic AI security keeps framing it as a brand-new threat. The more useful way to see it, adapted here fromOperational AI with Dockerby Ajeet Singh Raina and Harsh Manvar, is as the non-human identity problem we already know how to solve, surfacing in a place most teamshaven'tlooked yet.Here'sthe versionI'dsend to anyone running agents in production.MCP's credential problem is the old IAM problem in new clothesMost teams are securing their agents with shared API keys scattered across env vars and config files. That's the non-human-identity problem at machine scale and there's a clean pattern that closes it.Every few years,the industry rediscovers a problem italreadysolved, gives it a new name, and acts surprised. The "agentic AI security crisis" is this year's edition. When I read that only 22% of teams treat their agents as independent identities, and that 88% have already had or suspected a security incident, my honest reactionwasn'talarm.It wasrecognition.We'veseen this exact shape before, with service accounts and CI runners and every other non-human thing wehandeda credential and then forgot about. The CISO line making therounds,that MCP will be the AI security issue of 2026, isprobably right. It justisn'tnew.What changed is the speed. MCP made it trivially easy to give an agent real capabilities(a filesystem, a database, a GitHub account),and teamswiredthose up the way you wire things up whenyou'retrying to ship. The credentials wentwhereverwas convenient. That convenience is the whole problem.Two failure modes matter. The first is the obvious one:secrets exported as environment variables. The moment you do that, the key is sitting in your process list, in the output of a container inspect, in logs and stack traces, and in Git history,the instant a setup script gets committed. The second is quieter and worse.Each server manages its own credentials, so three servers that all need a GitHub token means three copies of that token, storedinthreedifferent ways.There'sno single place to rotate them, and no single action that revokes access. If one leaks,you'rehunting.The fix is a mediation layer. Route credentials through onebrokerso the agent never holds the raw secret, scope each agent to only the tools itactually needs, and make revocation a single action instead of a scavenger hunt. The book uses Docker's MCP gateway as its worked example:one endpoint every client connects through, backed by a single secret store.But the productisn'tthe point;the pattern is. Giving an agent an "identity"isn'ta philosophical move; it just means you can grant, scope, and revoke its access in one place, the same way you would for a person.None ofthisneedsnew technology. It needs the access disciplineyou'dnever skip for a human user, applied to the agents you quietlyhaven't.This article has been adapted fromOperational AI with Dockerby Ajeet Singh Raina and Harsh Manvar. If you want the full playbook, the book takes you from a model running on your laptop to secure, scaled agentic systems in production, with hands-on coverage of Docker Model Runner, MCP, multi-agent architectures, and Kubernetes orchestration.GET THE BOOKTHE ULTIMATE LINUX & SYSADMIN BUNDLE | 24 BOOKS | FROM $18GET YOUR BUNDLE!HUMBLE-BUNDLE is here! 24 Packt titles covering everything you need acrossLinux, SysAdmin, security, and infrastructure. Total MSRP across all 24 books is >$1,000. Bundle starts at $18. Part of every purchase supports the Prevent Cancer Foundation. Offer ends in 15 days.New here?PacktCloudProis a newsletter fromPacktfor senior cloud, DevOps, and platform engineers who want the call, not the concept. We focus on decision frameworks and real trade-offs for readers who already know the fundamentals anddon’tneed another explainer.If something landed or missed, hitreplyand tell me.I read every response, andit'show I figure outwhat'sworth running more of.Want to subscribe, or promote your product to this audience?Reach out to me directly.SUBSCRIBE*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0

Shreyans from Packt
22 May 2026
Save for later

Why most AI agents break in production

Shreyans from Packt
22 May 2026
Weekend Sale is LIVEJoin 100+ Engineers and build agentic network automation with MCP.Book Your SeatUse Code WEEKEND40 to get 40% OffIf you've been trying to build agentic workflows on top of MCP and hitting the usual walls — agents that stall halfway through a task, tools that don't compose well, loops with no real recovery path: this is the event for that.It's 3.5 hours, hands-on, in a Containerlab environment. William Collins and John Capobianco are running it. Both are at Itential and have been working on this stuff longer than most people in the space. John wrote Automate Your Network. William runs their tech evangelism. They know what breaks in production because they've watched it break.What you'll actually do in the session:Take raw MCP tools and turn them into skills you can reuse across agentsUse spec-driven development so the agent stays bounded to what you asked forBuild recovery into your loops instead of catching exceptions after the factWork through all of it on OpenClaw, NetClaw, and ContainerlabIt's not an intro session. This one assumes you've already tried to build something and want to do it properly. 100+ engineers have signed up already, join them.Book Your SeatUse Code WEEKEND40 to get 40% OffBuilding AI-Native Platform Engineering Systems: Last 10 seatsBook Your SeatUse Code WEEKEND40 to get 40% OffBrowse Our Networking TitlesAI Networking CookbookGET 40% OFFNetwork Automation CookbookGET 40% OFFNetwork Automation with NautoBotGET 40% OFFMastering Python NetworkingGET 40% OFF*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0

Shreyans from Packt
18 May 2026
Save for later

Your AI Assistant Doesn't Need More Access. It Needs a Tier List

Shreyans from Packt
18 May 2026
A simple model for deciding which platform actions an AI can run on its own — and which it shouldn'tCloud attacks have a new entry point. It's your running applications.That’s why a new category is emerging: Cloud Application Detection and Response (CADR).This guide breaks down what CADR is, why runtime is the only place real attacks can be detected, and how security teams are protecting applications, cloud infrastructure, and AI systems in production.If you’re responsible for securing modern cloud workloads, this is a concept you’ll want to understand.Get the GuideI wrote a while back about how MCP turns your internal platform into something an AI can act on, not just answer questions about. The follow-up question I kept getting was the practical one: fine, but how much do you actually let it do? This issue is the answer: a straightforward way to draw that line before the assistant does something you can't walk back.Cheers,Shreyans SinghEditor-in-ChiefYour AI Assistant Doesn't Need More Access. It Needs a Tier List.A simple model for deciding which platform actions an AI can run on its own, and which it shouldn't.Workshop: Building AI-Native Platform Engineering Systems Saturday, May 30 · Online · live and hands-on (5 hours)If the guardrails question in this issue is live for your team, the workshop is where it gets the full treatment. It's a hands-on cohort from Packt, in collaboration with FAUN.dev(), on evolving cloud-native platforms into AI-native ones: internal developer portals, golden paths, telemetry, and a dedicated session on policy-as-code, governance, and guardrails (exactly what's below).It's built for platform engineers, SREs, architects, and tech leads already running cloud-native platforms who want to add AI deliberately, without giving up control. Speakers: Asanka Abeysinghe, Dr. Gautham Pallapa, Mark Peter, and Thiago Shimada Ramos. Every attendee also gets a free Platform Engineering for Architects e-book.Use code SAVE50 for 50% offBook Your SeatOnce you wire an AI assistant into your platform through MCP, it stops being a chat window. It can deploy, scale, roll things back, actually do the work. Which is great, right up until you notice nobody decided what it's allowed to do on its own. On most teams that call never gets made deliberately; it just happens, one engineer and one service at a time.MCP is good at the wiring. It exposes your operations as tools the assistant can call. What it doesn't hand you is the judgment about which of those tools should sit one approval away and which shouldn't. That part you build yourself.The model that works is simpler than it sounds. Take any action the assistant could perform and ask two things: if it goes wrong, can you undo it, and how far does the damage spread. That gives you three tiers.Low-risk actions are reversible and contained: querying logs, reading metrics. Let the assistant just do those. Making someone approve a log query is the kind of friction that teaches people to stop using the tool.Medium-risk actions are reversible but have real blast radius. Scaling a service is the obvious one. You can scale it back, but in the meantime you've moved cost and capacity for everything downstream. These should draft a plan and route to an approver.High-risk actions are the ones you can't take back: deleting a database is the standard example. Those stay blocked by default, and the way through is a formal approval path, not a quick thumbs-up in a chat thread.The tiers themselves aren't really the interesting part, though. The interesting part is deciding the line once, as policy, for the whole org. Skip that and every team draws its own boundary: one ships with approvals, the next one skips them, and you've rebuilt the exact inconsistency you adopted a platform to kill.The other thing worth saying: this only holds if the safe path is also the easy path. If approval is slow and annoying, people find ways around it, and your guardrails quietly turn into a cage. So auto-execute the genuinely safe stuff generously. That's what makes the gated stuff feel worth the wait. And log everything, every tier, no exceptions.Decide the tiers before you connect the assistant. Doing it afterward usually means doing it in response to something you wish hadn't happened.Move from raw MCP tools to composable skills, apply spec-driven development to constrain agent behavior, and design agentic loops that recover when they get stuck. Hands-on with OpenClaw, NetClaw, and Containerlab.Book Your SeatUse Code EARLY40 to get 40% Off*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0
Subscribe to Packt _CloudPro
Our mission is to bring you the freshest updates in Cloud, Identity and Access Management, CI/CD, DevSecOps, Cloud Security, and adjacent domains.

Shreyans from Packt
11 May 2026
Save for later

Get 50% off on AI Platform Engineering Workshop

Shreyans from Packt
11 May 2026
50% OFF for 72 Hours OnlyBook Your SeatUse Code SAVE50Most platform teams trying to add AI right now are running into the same problem: bolting LLMs onto a cloud-native platform creates more issues than it solves. Governance gets weaker, control surfaces get fuzzier, and the platform that worked at scale yesterday starts looking fragile.AI-native platforms aren't cloud-native platforms with AI on top. They're a different architecture.What you'll walk away with:A working approach to designing the platform intelligence layer: data, inference, telemetry, controlThe OWASP LLM Top 10 mapped to internal platforms: guardrails, trust boundaries, jailbreak preventionHands-on exposure to Backstage, OpenChoreo, Crossplane, and Kiro/Kiro-cli used togetherA practical roadmap framework to define your team's AI-native platform plan and tie it to developer productivity and business outcomes.Running it: Asanka Abeysinghe, Dr. Gautham Pallapa, Mark Peter, and Thiago Shimada Ramos. In collaboration with FAUN.dev(), where engineers from GitHub, Netflix, and Shopify go to stay ahead.Book Your SeatMove from raw MCP tools to composable skills, apply spec-driven development to constrain agent behavior, and design agentic loops that recover when they get stuck. Hands-on with OpenClaw, NetClaw, and Containerlab.Book Your SeatUse Code EARLY40 to get 40% OffJoin Spec-Driven Development Cohort 2. If you’re already using AI to code and want more reliable, scalable outputs, this is worth your time. Seats are filling fast, we’ll be closing registrations soon.Book Your SeatUse code SDD45 to get 45% off.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0

Shreyans from Packt
08 May 2026
Save for later

24 hours until our Flagship AI Agents Workshop

Shreyans from Packt
08 May 2026
Last 10 SeatsLAST 10 SEATS | WORKSHOP in 24 HOURSGet 40% OffUSE CODE FINAL40In 24 hours, 100+ network engineers will spend 4 hours building AI agents for network operations: the right way, from the architecture up. 10 seats left. This is your last chance to join them.You'll build a working AI agent from scratch using LLM tool calling and agentic loops.Production-adapted code you can run in your own environment the same week.You'll deploy it against real Arista cEOS devices in Lab 4You'll join a live panel with Sif Baksh and Eduard DulharuThe workshop is led by Sif Baksh: 15 years across NetOps and SecOps. He's spent a decade and a half turning operational chaos into systems people can reason about: SOAR workflows, DDI migrations, and lately AI-assisted automation that survives contact with production.Joining him on the panel: Eduard Dulharu, CTO and Co-Founder of vExpertAI GmbH in Germany. He'll be bringing the founder-CTO lens on what production AI systems actually look like in operational environments.What's different on Monday if you attend:You'll know exactly how to design AI agents that don't hallucinate device commands or break running configurations under load.You'll have agentic patterns that account for how network operations actually work: state, ordering, idempotency, rollbackAnd you'll have a working code template you can adapt to your own environment, against your own devices, the same day.LAST 10 SEATS | WORKSHOP in 24 HOURSBook Your Seat Now at 40% OffUSE CODE FINAL40After this workshop, you actually start deploying AI agents in network operations with confidence.Hope to see you there,- Packt Conferences*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0

Shreyans from Packt
04 May 2026
Save for later

A Tines principal architect is teaching network engineers to build AI agents this Saturday.

Shreyans from Packt
04 May 2026
Last Chance to JoinJoin 100+ network engineers this Saturday and build a working AI agent from scratch, deployed against real Arista cEOS devices.LAST CHANCE TO JOIN | GET 40% OFFUse Code FINAL40 | Expires in 48 HoursSif Baksh (Principal Solutions Architect, Tines) is running this workshop. He built this curriculum from production work, not slides. What you get from this workshop:Lab 4: you ship a working agentic bot against real network infrastructureThe P.E.N.E. framework: built specifically for how network engineers communicate intent to LLMsReusable code you own, can modify, and can debug at 2am when something breaksP.S. You leave with the code. Not demos or homework. That's the whole point of the workshop.LAST CHANCE TO JOINWalk away with patterns for where AI fits, how governance holds up, and a 6–12 month roadmap to evolve your platform deliberately.Book Your SeatUse Code SAVE30 to get 30% OffMove from raw MCP tools to composable skills, apply spec-driven development to constrain agent behavior, and design agentic loops that recover when they get stuck. Hands-on with OpenClaw, NetClaw, and Containerlab.Book Your SeatUse Code SAVE30 to get 30% Off*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0

Shreyans from Packt
27 Apr 2026
Save for later

Most "I Need Intune Admin Rights" Requests Aren't About Admin Rights

Shreyans from Packt
27 Apr 2026
A clean way to decide when users need elevation — and when they just need the right app delivery. Local intune admin rights are one of those problems every Windows shop runs into eventually, and EPM is usually the first answer that comes up. But how you roll it out, and who you actually license for it, makes a real difference to both your security posture and your budget. Today's CloudPro issue is adapted from Mastering Endpoint Management using Microsoft Intune Suiteby Saurabh Sarkar and Rahul Singh, and it lays out a clean way to think through the decision. Cheers, Shreyans Singh Editor-in-Chief Early Bird Offer: Use code EARLY40 to get 40% Off Book Your Seat Now CloudPro #124: Most "I Need Intune Admin Rights" Requests Aren't About Admin Rights Read on Web Every security team eventually gets there: local admin rights have to go. Fair enough- they're a known weakness, and leaving users as admins on their own machines isn't defensible anymore. So you start looking at EPM, and the easiest thing to do is buy licenses for everyone, turn on self-elevation, and call it solved. Pause before you do that. You'll spend more than you need to and you'll be using the tool for something it isn't really designed for. EPM gets pitched, and often understood, as "the safe way to give users admin rights." That framing is the source of most of the trouble. EPM isn't an admin-rights replacement and it isn't an app delivery mechanism. It's elevation control. It lets a specific application run with admin privileges in a specific moment, under a rule you've defined. That's a much narrower job than "make this user an admin," and once you see the difference, the licensing decision gets a lot easier. Start with app delivery, because most "I need admin" requests aren't actually about admin. They're about getting an app installed. If you push your common business apps via Intune as Required, they install in the system context and the user doesn't need elevation at all. For the long tail of apps where you're not sure who needs what, make them Available through Company Portal. The user installs them on demand, still in system context, still without elevation. Get this layer right and a huge chunk of the supposed "need for admin rights" disappears. What's left after that is the actual EPM territory. Someone needs to install something niche that isn't in your catalog and never will be. A support engineer needs to run ProcMon elevated to debug a real issue. A developer needs an elevated PowerShell window. A user needs to restart a stuck service. These are the cases EPM was built for: controlled, rule-based elevation for specific apps and specific moments. Worth assigning a license for. Worth setting up properly. One distinction worth being clean about: EPM is not application control. If your goal is to stop certain apps from running on your devices, EPM doesn't do that. It decides what gets elevated, not what gets to run in the first place. App Control for Business is the right tool for that, and conflating the two leads to policies that don't do what you think they do. Which brings the licensing piece into focus. EPM licensing is per-user, and the population that genuinely runs elevated workloads is almost always a fraction of your fleet: engineers, IT support, certain power users. The rest of your users get their apps through Required and Available and never hit an elevation prompt. Licensing everyone "just in case" is the most common way teams overspend on the Intune Suite, and it usually happens because the team skipped the app-delivery question and went straight to the elevation question. Get the delivery layer right first. EPM is for what's left over. Read on Web FLASH SALE: 40% OFF | 24 HOURS ONLY Book Your Seat Now Catch the latest HubSpot Developer Platform updates in Spring Spotlight Explore Now 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}
Read more
  • 0
  • 0

Shreyans from Packt
20 Apr 2026
Save for later

Three Rules for Designing an MCP Server You Won't Regret

Shreyans from Packt
20 Apr 2026
Split read and write. Build specific tools. Validate locally.Early Bird Offer: Use code EARLY40 to get 40% OffBook Your Seat NowCloudPro #123: Three Rules for Designing an MCP Server You Won't RegretRead on WebThere's a decent chance you're about to build an MCP server for your observability platform, or your team just shipped one. Before it becomes the thing everyone depends on, three design choices will save you a rewrite later.This matters because once agents start relying on your MCP server, the contract gets sticky. You can change the implementation, but renaming tools or restructuring scopes breaks every instruction file and agentic workflow already pointing at them. The first version tends to become the permanent version.Rule one: don't mix read and write.It's tempting to expose everything from a single server: get_logs and list_problems alongside the handy create_workflow and send_notification stuff. Don't. Put read-only data access in one server and anything that modifies state or triggers actions in a second, separately installed one.Two reasons. The obvious one is accidental writes. The less obvious but bigger one is prompt injection. If an agent is connected to both, a poisoned log line or a malicious doc in your RAG pipeline can talk the agent into calling an admin action it had no business calling. Splitting servers shrinks that attack surface. And yes, the MCP spec lets users disable individual tools, but realistically most people leave everything on.Rule two: don't ship only a generic execute_query toolEvery observability backend has a query language: NRQL, DQL, PromQL, whatever yours is. You could expose just execute_query and let the agent figure out the syntax. It'll work, but badly. The agent guesses, gets a syntax error, retries, refines, retries again. Every round-trip costs API calls, tokens, and latency.Build purpose-specific tools for your top use cases alongside the generic one. A get_logs tool taking a timespan and workload identifier will run a clean, optimized query on the first try. No guessing.Don't overcorrect though. get_logs_from_k8s, get_logs_from_hosts, get_logs_from_apps: now you're maintaining ten tools and the agent picks the wrong one half the time anyway. Aim for the middle.Rule three: validate locally before hitting the backendWhen the agent sends a query, check it inside the MCP server first. Is the syntax valid? Does the caller actually have permission to access that data? These are cheap checks and they kill the trial-and-error loop at the source. Without them, every bad query becomes a billable backend call.Instrument your MCP server and you'll see the pattern fast: most failures are the same two or three mistakes. Catch them locally.These aren't the only mistakes you can make, but they're the ones that compound quietly until you're rebuilding the server at version two. Worth getting right the first time.Read on WebThis article was adapted from Observability in the AI-Native Era.50% Off for the Next 24 Hours: Get The BookCheers,Shreyans SinghEditor-in-ChiefEarly Bird Offer: Use code EARLY40 to get 40% OffBook Your Seat Now📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0

Shreyans from Packt
06 Mar 2026
Save for later

Still Using DeploymentConfig? Here's Why It's Time to Move On

Shreyans from Packt
06 Mar 2026
Learn Openshift 4CloudPro #122:Still Using DeploymentConfig? Here's Why It's Time to Move OnWe're building something new. A dedicated newsletter for network engineers working with AI, where we will talk about real challenges of bringing AI into production networks. Before we launch, we want to hear from you. This is a 30-second survey, and your answers will directly shape what we cover.Take the SurveyIf you've been running OpenShift for a few years, you almost certainly have DeploymentConfigs somewhere in your cluster. They work fine. They've always worked fine.But they've been officially deprecated since OpenShift 4.14, and it's worth understanding why. Not just because Red Hat says so, but because the underlying technology has genuinely moved on.Some context. When OpenShift 3 shipped back in 2015, Kubernetes didn't have great deployment management. DeploymentConfig filled that gap: lifecycle hooks, image change triggers, custom rollout strategies. It was ahead of its time, honestly.But Kubernetes caught up. Deployments now do automated rollbacks, HPA-based autoscaling, pause and resume during rollouts. And they're built on ReplicaSets, not ReplicationControllers. That bit matters because ReplicationController development has stopped upstream entirely. The foundation DeploymentConfig sits on isn't getting any love anymore.There's a design difference worth knowing about too. DeploymentConfig leans toward consistency: if the node running your deployer pod dies, it just waits. Waits for the node to recover, or for someone to manually step in. Deployments lean toward availability: the controller manager runs across multiple masters, so another one picks up the work. In production, you usually want that.The other thing is portability. DeploymentConfigs only exist in OpenShift. Deployments are standard Kubernetes. If you ever need to move workloads between clusters or providers, that distinction starts to matter a lot.And by sticking with DeploymentConfig, you're also cutting yourself off from tooling. No Argo Rollouts for canary or blue-green. No native HPA. No automated rollback. Manual scaling, manual rollback. It's fine until you're doing it at 2am during an incident and wishing you weren't.Nobody's flipping a switch on you tomorrow. DeploymentConfigs still run. But "it still works" isn't really a strategy. If you haven't started thinking about this, now's a good time.Cheers,Shreyans SinghEditor-in-ChiefRead The ArticleThis article was adapted from Learn OpenShift.Get The BookEarly Bird Offer LIVE Now: 40% Off. Use code EARLYBIRD40Book Your Seat NowEarly Bird Offer LIVE Now: 50% Off. No Code Needed.Book Your Seat Now📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0

Shreyans from Packt
19 Feb 2026
Save for later

Your Kubernetes cluster wasn't built for AI workloads

Shreyans from Packt
19 Feb 2026
By Shadab Hussain and Sandeep RaghuvanshiCloudPro #120Most AI workshops teach you how to deploy a model.This one teaches you what happens after. When the traffic spikes, the GPU scheduling fails, and your platform team is debugging it at 2 AM.Early Bird Offer LIVE Now: 40% Off With Code EARLY40Book Your Seat NowOffer Ends in 72 HoursLearn from engineers who run AI infrastructure at scale:Shadab Hussain: Google Developer Expert (AI/ML), Lead Engineer at MathCoSandeep Raghuwanshi:Former Kubernetes SME at Microsoft, Head of DevOps & InfoSec at BureauNicolas Vermandé: Senior Developer Advocate at ScaleOps, CCIE #47363 & VCDX #55Derek Ashmore: Agentic AI Enablement Principal at Asperitas ConsultingWhat you'll walk away with:Infrastructure patterns for safely running AI workloads and agents on K8sGPU scheduling and scaling strategies that hold up under real loadZero-trust security controls for AI agent trafficA tested playbook for debugging AI-related production failuresFirst-hand experience from a capstone incident simulation: traffic spike, resource contention, partial failureEarly Bird Offer LIVE Now: 40% Off With Code EARLY40Book Your Seat NowOffer Ends in 72 HoursCheers,Shreyans SinghEditor-in-Chief📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
Shreyans from Packt
13 Feb 2026
Save for later

How to Build Always-On Applications on Azure

Shreyans from Packt
13 Feb 2026
By Stephane EyskensCloudPro #119WATCH NOWToday’s CloudPro Expert Article comes from Stéphane Eyskens, a Microsoft Azure MVP and seasoned solution architect with over a decade of experience designing enterprise-scale cloud systems. Stéphane is the author of The Azure Cloud Native Architecture Mapbook (2nd Edition), a comprehensive guide featuring over 40 detailed architecture maps that has earned 5.0 stars on Amazon and become an essential resource for cloud architects and platform engineers.In the article below, Stéphane tackles one of the most challenging aspects of Azure architecture: building truly resilient multi-region systems, with concrete examples using Azure SQL, Cosmos DB, and Azure Storage, complete with code samples and Terraform scripts you can adapt for your own DR testing.Happy reading!Shreyans SinghEditor-in-ChiefSAVE THE ARTICLE FOR LATERThe Azure Cloud Native Architecture Mapbook, Second EditionDesign and build Azure architectures for infrastructure, applications, data, AI, and security.Get 40% off eBookAnd 20% off PaperbackFor the next 72 HoursGET THE BOOKI like to say that Azure is simple… until you go multi-region. The transition from a well-designed single-region architecture to a truly resilient multi-region setup is where simplicity gives way to nuance. Concepts that seemed abstract (high availability versus disaster recovery, failover semantics, DNS behavior, data replication guarantees) suddenly become very real, very concrete, and sometimes painfully operational.This article is written for architects and senior platform engineers, who already understand the fundamentals but are required to build solutions that must remain available despite regional outages, service failures, or infrastructure-level incidents. The scope is intentionally narrowed to Recovery Time Objective (RTO). Data corruption, ransomware, and backup-based recovery are explicitly out of scope. Instead, the focus is on how applications and data services behave during live failover scenarios, and how architectural decisions, sometimes subtle ones, can make the difference between a seamless transition and a prolonged outage.Through concrete examples using Azure SQL, Cosmos DB, and Azure Storage, this article explores how replication models, DNS design, private endpoints, and SDK behavior interact at runtime, and what architects must do to ensure their applications remain functional when regions fail.Rather than focusing on theoretical patterns, the goal here is pragmatic—minimizing downtime and operational friction when things do go wrong. You’ll see diagrams, Terraform and deployment scripts, plus .NET code samples you can adapt for your own DR tests and game days.Before getting into the details, let’s briefly revisit the difference between high availability (HA) and disaster recovery (DR).HA and DR exist on a spectrum, with increasing levels of resilience depending on the type of failure you want to withstand:Application-level failures: In some cases, you may simply want to tolerate application bugs—for example, a memory leak introduced by developers. Running multiple instances of the application on separate virtual machines, even on the same physical host, can already prevent a full outage when one instance exhausts its allocated memory. That is for instance, what you would get if you spin up 2 instances of an Azure App Service within the same zone (no zone redundancy).Hardware failures: To handle hardware failures, workloads should be distributed across multiple racks. That is what you would get if you’d host virtual machines on availability sets.Data centre–level outages: To withstand more severe incidents, workloads should be spread across multiple data centers, such as by deploying them across multiple availability zones. You can achieve this by turning on zone-redundancy on Azure App Service or use zone-redundant node pools in AKS. With such a setup, you should survive a local disaster such as fire, flooding, etc.Regional outages: Finally, to survive major outages, such as a major earthquake, a country-level power supply issue, etc., workloads must be deployed across geographically distant data centers. You can achieve this by deploying workloads across multiple Azure regions in active/active or active/passive mode.Looking at Azure SQLLet’s first analyse the different data replication possibilities with Azure SQL. Table 1 summarizes the different capabilities.Table 1 – Replication capabilitiesWe’ll set aside named replicas and geo-restore, as the former does not contribute to disaster recovery and the latter is likely to introduce significant downtime and potential data loss. This leaves geo-replication as the remaining option. As you might have understood by now, using Azure SQL’s built-in capabilities, you cannot achieve a full ACTIVE/ACTIVE setup since it doesn’t support multi-region writes. This means that you can only have one read-write region and the secondary region(s) are read only.Table 2 outlines the two available geo-replication techniques.Table 2 – Geo replication optionsActive geo-replication may require updates to connection strings or DNS records to point to the new primary after a failover. That said, the actual impact depends on where (*) the client application is located as well as how you deploy to both regions. Let’s look at this in more detail. Figure 1 illustrates an active geo-replication setup between Belgium Central and France Central.Figure 1 – SQL geo replication with active geo replicationIn such a setup, under normal circumstances:Workloads in the primary region (Belgium Central) can connect to the primary server in read/write modeWorkloads in the primary region can perform read-only activities against the secondary replica, providing they tolerate the extra latency incurred by the roundtrip to the remote region (France Central).Workloads in the secondary region (if any), can perform read-only operations against the read replica with no extra latency.The configuration shown in Figure 1 supports a database-only failover. Both regions expose private endpoints to both SQL servers and rely on region-scoped DNS zones.Although Private DNS zones are global by design, keeping them regional allows each region to resolve both the primary and secondary servers. This requires four DNS records in total—primary and secondary endpoints registered in each regional zone.With a single shared DNS zone, this would not be possible: while all four private endpoints could be deployed, only two DNS records would be registered, since the endpoints map to just two FQDNs (primary and secondary). While this approach works, it keeps the regions siloed and prevents any cross-region traffic. From a resilience standpoint, it is preferable to provide as many fallback paths as possible.Moreover, as we will see later, with other resources such as Storage Accounts, a single DNS zone would force us to update the DNS records upon failover, causing a minimal downtime. Bottom line: using multiple DNS zones prevents issues during failover.Back to active geo replication! In case of failover, SQL servers switch roles: the primary becomes secondary and vice versa. This concretely means that the connection string primary.database.windows.net targets the read/write region in a normal situation but a read-only or unavailable one after failover. Workloads using this connection string would either stop working (if the regional outage persist), either talk to a read-only database instead of a read-write one, once the failover completed. Similarly, the connection string secondary.database.windows.net usually targeting the read-only region under normal circumstances now targets the read-write one after failover.Knowing this, a few options exist:You may choose to fail over everything (database+compute). In that scenario, workloads running in the secondary region can use their default secondary connection string, which will automatically target the new primary after failover. This approach requires the deployment pipeline to be region-aware, detect the target region, and apply the appropriate connection string. When deployed in the primary region, the application should use primary.database.windows.net, while in the secondary region it should already be configured with secondary.database.windows.net. This design eliminates the need for any connection string changes after failover. If your webapps, K8s pods, etc. are already up and running, the only thing you still have to do is route traffic to them. Any other SQL client not running in the secondary region (eg: on-premises), would have to update its connection string to target the new primary.You may choose to redeploy the compute infrastructure (web apps, etc.) to the secondary region only in case of regional outage. This approach is cheaper but risky as you’re not guaranteed to have the available capacity and it is causing a significant downtime. However, such an approach allows you to adjust your pipelines, specify the right connection string and simply redeploy your infrastructure and/or application package.If you want to deploy the application with the exact same settings in both regions, you’ll need to update the connection string used by workloads in the secondary region, since primary.database.windows.net will now resolve to an unavailable server after failover. If the original primary later comes back online, it will return as a secondary (read-only) replica, which would not support write operations. You can as well make your application failover aware (**).You can’t simply update DNS, meaning making secondary target primary and vice versa, because the FQDN (primary-or-secondary.database.windows.net) is validated by the target server, and the names must match—so redirecting it to a different server would simply fail.In conclusion, when using active geo-replication as the replication technique, you should make your applications failover-aware (**) and pre-provision both connection strings and implement the failover/retry logic in the application code itself. You may wrap your Entity Framework context into a factory to abstract away the retry logic. Given we typically use a scoped lifetime, you may expect some HTTP requests to fail (in case of an API) but new instances targeting the right server would ultimately succeed without having to restart the application. You may as well use a geo-redundant Azure App Configuration and failover it along SQL, then switch the primary server connection string after failover. The SDK allows you to monitor a sentinel key and to reload the configuration without having to restart the application:Read The Full Article by Stephane Here18 cloud architecture books in one bundle including AWS for Solutions Architects, Kubernetes for Generative AI Solutions, and more.2000+ Bundles already sold.Get The Bundle at$858$5.9048-Hour Flash Sale: 40% off with code FLASH40Book Your Seat NowEarly Bird Offer LIVE Now: 40% Off With Code EARLY40Book Your Seat NowWebinar: How to Build Faster with AI AgentsSave Your Seat📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0

Shreyans from Packt
06 Feb 2026
Save for later

A blueprint for cyber resilience...

Shreyans from Packt
06 Feb 2026
.CloudPro #118Attackers are actively trying to keep you from recoveringIn the event of a cyberattack, the cost of downtime is measured not just in financial terms, but in operational disruption and reputational damage. While prevention strategies are crucial, they are not a substitute for a robust recovery plan.Backups alone do not guarantee a clean restoration.We invite you to our virtual event,Foundations of Cyber Resilience, on 11 February, where we will provide a practical framework for what happens after a breach.You will learn:Whytraditional recovery strategies can fail when they are needed most.Howto detect and eliminate threats within your backups to prevent reinfection.Key componentsof a modern, orchestrated, and clean recovery process.REGISTER NOWNext week in CloudPro, we're dropping something special: a deep-dive from Microsoft Azure MVP Stéphane Eyskens that every cloud architect needs to read.If you've ever wondered why your multi-region Azure setup feels more complex than it should, or if you're still figuring out what actually happens when a region goes down, this one's for you.Stéphane, author of the 5-star rated Azure Cloud Native Architecture Mapbook, is sharing battle-tested patterns for building truly resilient systems using Azure SQL, Cosmos DB, and Storage. We're talking real code, Terraform scripts, and the kind of insights you only get from years in the trenches.Here's a sneak peek into what's coming...Cheers,Shreyans SinghEditor-in-ChiefHow to Build Always-On Applications on AzureBy Stephane EyskensA Sneak PeekBefore getting into the details, let's briefly revisit the difference between high availability (HA) and disaster recovery (DR).HA and DR exist on a spectrum, with increasing levels of resilience depending on the type of failure you want to withstand:Application-level failures: In some cases, you may simply want to tolerate application bugs—for example, a memory leak introduced by developers. Running multiple instances of the application on separate virtual machines, even on the same physical host, can already prevent a full outage when one instance exhausts its allocated memory. That is for instance, what you would get if you spin up 2 instances of an Azure App Service within the same zone (no zone redundancy).Hardware failures: To handle hardware failures, workloads should be distributed across multiple racks. That is what you would get if you'd host virtual machines on availability sets.Data centre–level outages: To withstand more severe incidents, workloads should be spread across multiple data centers, such as by deploying them across multiple availability zones. You can achieve this by turning on zone-redundancy on Azure App Service or use zone-redundant node pools in AKS. With such a setup, you should survive a local disaster such as fire, flooding, etc.Regional outages: Finally, to survive major outages, such as a major earthquake, a country-level power supply issue, etc., workloads must be deployed across multiple Azure regions in active/active or active/passive mode.Next week, Stéphane walks through exactly how to architect for each scenario, with diagrams, code, and real failover examples you can test yourself. Don't miss it.Early Bird closes in 72 hours. Last Few Seats At This Price.Book Your Seat NowUse code EARLYBIRD40 to get 40% OffWe're running a 5-hour workshop on architecting production-grade GenAI systems on AWS. Hands-on, practical, built for cloud architects and engineers.Here's the deal:Most GenAI content is either toy demos that work once or vendor pitches. This isn't that.We took real production problems: models breaking after launch, RAG pipelines failing silently, agents that cost too much or hallucinate in production, and turned them into architectural patterns using AWS services and real-world trade-offs.You'll learn how to pick the right AWS model for quality, cost, and latency.You'll build and tune RAG pipelines that don't break when data changes.And you'll understand when to use agents versus when they'll create more problems than they solve.Early Bird closes in 72 hours. Last Few Seats At This Price.Book Your Seat NowUse code EARLYBIRD40 to get 40% OffEarly Bird Offer LIVE Now: Get 40% Off TicketsBook Your Seat NowUse code EARLY40 to get 40% Off📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0

Shreyans from Packt
20 Jan 2026
Save for later

Your AI data is everywhere. Here’s how to actually see and secure it on Feb 3.

Shreyans from Packt
20 Jan 2026
.CloudPro #117Every week, AI helps your team work faster, but it also increases your data’s exposure. Files move between new tools, models use sensitive data, and traditional DLP often misses the most important context.OnFebruary 3 at 11:00 AM PT, we’ll introduce Cyberhaven’s data lineage powered and unified DSPM and DLP platform. You’ll see how one AI-native solution can finally keep up with the way data really moves.Join us live to see:The first public demo of our unified AI and data security platform, designed for the challenges of 2026 and beyond, including SaaS sprawl, shadow AI tools, and constantly moving data.How security teams gain x-ray vision into data usage, so they can spot the risky handful of actions hidden in millions of “normal” events—and stop them in real time, not after the damage is done.Hear honest stories from security leaders about where legacy DLP and standalone DSPM fall short, and how they are rethinking data protection by focusing on context instead of fixed rules.Get a preview of what’s next for DLP, insider risk, AI security, and DSPM from Cyberhaven’s product and leadership teams, along with our future investment plans.Register NowDon’t wait for another AI-related incident to reveal gaps in your data security. Reserve your spot and be among the first to see how a unified DSPM and DLP platform can change how your organization protects its most important data.The official Kubernetes Dashboard is getting archived after a decade. No active contributors or maintainers left. End of an era for one of the earliest K8s UI projects.Meanwhile, someone trained LLMs on three years of incident postmortems and built systems that predict outages 15-45 minutes before alerts fire. We're also covering K8s 1.35's in-place pod restarts, why learning Linux primitives makes Kubernetes finally click, and a Palo Alto DoS flaw that crashes firewalls into maintenance mode.Plus: 20+ tools that auto-generate K8s diagrams and a game where you fix 50 broken clusters to learn.Cheers,Shreyans SinghEditor-in-Chief3 Days Remaining: Book Your Seat NowGet 30% OffUse code FINAL30This Week in CloudKubernetes 1.35 lets you restart entire pods in-placeK8s 1.35 adds in-place pod restart (alpha, behind RestartAllContainersOnContainerExits gate) which is huge for AI/ML workloads. Previously if an init container corrupted the environment or a sidecar failed, you had to delete the entire pod and let the scheduler recreate it: slow and expensive. Now you can trigger a full restart that preserves pod UID, IP, network namespace, sandbox, volumes, everything except ephemeral containers. All init containers rerun from scratch, giving you a clean state.Training AI on your incident history predicts outages 15-45 minutes earlySomeone trained LLMs on three years of incident postmortems and built systems that predict failures 15-45 minutes before traditional alerts fire.The trick is extracting causal embeddings. Not just "symptom and cause are related" but learning the transformation from "what we observed" to "what was actually wrong." They decompose incidents into structured reasoning chains, create separate vector spaces for symptoms/causes/resolutions/precursors, then continuously pattern-match current system state against historical precursor embeddings.Every tool that generates Kubernetes architecture diagramsHuge GitHub repo comparing 20+ tools that generate K8s architecture diagrams from manifests, APIs, Helm charts, etc.KubeDiagrams leads with 47+ resource types supported, reads from manifests/kustomize/Helm/API, outputs to PNG/SVG/PDF/DOT, supports namespace/label clustering. Most tools use Python with Diagrams library, some use Go/TypeScript/Java. Common pattern: 60% support KIS (Kubernetes Icons Set), 45% do namespace clustering, 95% show Services, 80% show Deployments.Learn Kubernetes by fixing 50 broken clustersOpen source game-based K8s training with 50 progressive challenges across 5 worlds (Core Basics, Deployments, Networking, Storage, Security). Each level breaks something in K8s and you fix it using kubectl. Has real-time monitoring with "check" command, progressive hints, step-by-step guides, post-mission debriefs explaining why your fix worked.Palo Alto patched a DoS flaw that crashes firewalls into maintenance modePalo Alto patched CVE-2026-0227 (CVSS 7.7), a DoS vulnerability in PAN-OS firewalls with GlobalProtect enabled that lets unauthenticated attackers crash firewalls into maintenance mode. PoC code already exists and a researcher reported it, though no active exploitation yet. This is almost identical to CVE-2024-3393 from late 2024 which was a zero-day.Early Bird Offer: 40% Off for 72 HoursGet 40% OffUse code EARLY40Deep DiveWhy you should learn Linux before diving into KubernetesDocker didn't invent containers. It wrapped existing Linux features (cgroups, namespaces) that Google had been using for years into a simple interface anyone could use. Every K8s feature relies on Linux primitives: pod isolation uses namespaces (PID, network, mount, user, IPC), resource limits use cgroups, networking uses iptables/nftables for ClusterIP services and NAT, network policies use packet filtering, images use OverlayFS for layered filesystems, Cilium uses eBPF for high-performance networking instead of iptables. When you create a Pod, you're orchestrating Linux isolation and resource management tools. Understanding namespaces, cgroups, network filtering makes K8s and Docker click—you realize they're just convenient wrappers over powerful Linux capabilities. Learn the foundation first, the abstractions make way more sense after.Auto-comment K8s manifest changes on PRsGo tool that receives GitHub webhooks for PRs, auto-discovers ArgoCD apps configured with that repo as source, generates diffs against live state using ArgoCD CLI, and comments on PRs with markdown showing what would change. No per-repo configuration needed.How etcd actually works (and why Kubernetes uses it)etcd is a strongly consistent distributed key-value store using the Raft consensus algorithm. All writes go through an elected leader, changes replicate to followers, new elections happen if leader dies. Production clusters typically run 3 or 5 nodes (odd numbers only since you need majority for availability). K8s stores everything under /registry prefix with naming like /registry/pods/<namespace>/<pod-name> , uses prefix queries and watch subscriptions for real-time updates. This is how controllers and operators subscribe to resource changes.Kubernetes Dashboard is being archived after a decadeThe official Kubernetes Dashboard project is getting archived after no active contributors and maintainers running out of time to work on it. Started in 2015 when K8s was still new, it served the community for over a decade but ecosystem needs have changed significantly. End of an era for one of the earliest K8s UI projects, but makes sense given how much the tooling landscape has evolved since 2015.Self-healing infrastructure is running in production right nowAutonomous healing infrastructure isn't science fiction. It's operational in production serving millions of users, and the difference from past attempts is reasoning capability. The architecture needs four pieces: decision engine combining rule-based policies with LLM reasoning for edge cases, safety sandbox that never executes directly in prod (snapshots state, enhanced monitoring, automatic rollback on any degradation), graduated action library (green/yellow/red based on risk), and learning loop where every action generates training data to improve confidence scores.📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
Shreyans from Packt
16 Jan 2026
Save for later

[New IT leader’s guide] Your blueprint for cloud resilience

Shreyans from Packt
16 Jan 2026
.CloudPro #116Attackers are actively trying to keep you from recoveringIt’s a hard truth, but recent intelligence confirms that cloud-native backup is now a primary target for groups like Storm-0501.To survive these threats, you need more than just infrastructure and data durability. You need a strategy built for an active adversary, one that includes mindset, architecture, and preparation.ReadThe Four Levels of Cloud Cyber Resilience: An IT Leader’s Guideto learn:Understandwhy relying on your cloud provider’s “uptime” gives you false confidence against targeted attacksUncover the blind spotsin your current security stack that will prevent fast recovery when seconds countGet the blueprintfor upleveling your cloud cyber resilienceMake sure your cloud can survive a cyberattack.Read NowIn today's CloudPro, we'll look at self-healing infrastructure that actually works is already running in production.Grafana's taking a similar approach with AI agents that investigate incidents in 13 minutes instead of hours, which could save your team about $90k/year in senior engineering time.Meanwhile, DORA's latest research on 5,000 tech professionals figured out which AI capabilities actually separate high performers from struggling teams.We've also got the real reason your network automation keeps failing, AWS's new pentesting agent, and why the Orca-Wiz patent war finally ended.Cheers,Shreyans SinghEditor-in-Chief24 Hours Remaining: Book Your Seat NowGet 40% OffUse code FINAL40This Week in CloudNetwork automation keeps failing because your data is a messNetwork teams keep kicking off "source of truth" projects to consolidate scattered data but EMA found these are "long and painful endeavors." The blockers: execs don't get why you need $60k for a database when apps are running fine, your network data lives in spreadsheets and random IPAMs with everyone doing their own thing, and even after you build it engineers keep making CLI changes that drift everything out of sync. The fixes are obvious but hard: get exec buy-in, use discovery tools, integrate with everything, and lock down CLI access until people actually trust it more than their spreadsheets.Kubernetes 1.35 adds structured debugging endpointsK8s 1.35 enhances z-pages debugging endpoints like /statusz and /flagz with structured JSON responses instead of just plain text. Now you can programmatically query component state for automated health checks and better debugging tools without parsing text output. Still alpha and requires feature gates, but if you're building internal tooling or want to automate component validation, worth experimenting with in test environments.Google wants gRPC as an official MCP transportModel Context Protocol uses JSON-RPC but enterprises running gRPC-based services need transcoding gateways.So Google's working with the MCP community to support gRPC as a pluggable transport directly in the SDK. gRPC gives you binary encoding (10x smaller messages), full duplex streaming, built-in flow control, mTLS, and method-level authorization.MCP maintainers agreed to support pluggable transports and Google will contribute a gRPC package soon.Grafana built AI agents that investigate incidents for youGrafana's Assistant Investigations deploys specialized AI agents in parallel during incidents. They analyze metrics, logs, traces, and profiles simultaneously to build a comprehensive picture in 13 minutes instead of the 2-4 hours a human takes. Real example: payment service latency issue detected connection pool exhaustion and traced it to a recent deployment in minutes, with zero PromQL knowledge needed.Conservative estimate saves 50 hours/month of senior engineering time = $90k/year in reclaimed expertise. Free during public preview, worth trying for three weeks to prove ROI.Self-healing infrastructure is here and it's not about replacing SREsAutonomous healing infrastructure is running in production serving millions of users, and the difference from past attempts is reasoning capability. Systems can finally understand context and make decisions that used to need human judgment. Most orgs are stuck at Level 2 (automated detection, human fixes) but we've deployed Level 5 (predictive prevention) for specific failure classes. Real results: memory leak auto-remediation in 7 minutes vs 35 minutes with humans, 73% autonomous resolution rate, 81% reduction in after-hours pages.The architecture needs four pieces: decision engine, safety sandbox, action library, and learning loop. The future of infrastructure is autonomous, question is whether you can afford not to build it.7 Days Remaining: Book Your Seat NowGet 30% OffUse code FINAL30Deep DiveDORA figured out which AI capabilities actually matterDORA's 2025 report on 5,000 tech professionals found AI adoption is universal but success varies wildly because AI amplifies what you already are: makes high performers better and struggling teams worse. They identified seven capabilities that determine if AI helps or hurts: clear AI stance, healthy data ecosystems, AI-accessible internal data, strong version control, small batches, user-centric focus, and quality platforms.AWS Security Agent does automated pentesting (and it's actually useful)AWS launched Security Agent at re:Invent: An AI agent that runs continuous penetration testing on your apps, currently in free preview. A test against DVWA took ~2 hours and found tons of vulns with actual PoC steps to reproduce, not vague scanner output. Definitely helps reduce pentest time but you still need your own manual testing: think of it as a teammate, not a replacement.AWS Direct Connect now supports chaos engineering with FISAWS Direct Connect now integrates with Fault Injection Service so you can run controlled chaos experiments testing BGP session disruptions on your Virtual Interfaces. You can validate that traffic actually routes to redundant VIs when the primary BGP session fails and your apps keep working as expected.Basically chaos engineering for your Direct Connect architecture before a real outage proves your failover doesn't work.Orca and Wiz dropped their patent lawsuit slugfestOrca and Wiz agreed to dismiss all claims in their dueling patent lawsuits after the US Patent Board invalidated three of Orca's six asserted patents for lacking novelty. The whole mess started in July 2023 when Orca accused Wiz of copying their architecture, Wiz countersued, and now it's over 10 months after Google agreed to acquire Wiz for $32 billion. Orca's worth $1.8B by comparison and has shrunk headcount 7% while Wiz nearly tripled to 3,150 employees.📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0

Shreyans from Packt
16 Dec 2025
Save for later

How Google Built a Kubernetes Cluster with 130,000 nodes

Shreyans from Packt
16 Dec 2025
AWS launched a DevOps Agent that actually debugs production for you CloudPro #115 Elevate Your Cloud Security Strategy with Dark Reading As a cloud security professional, you need cutting-edge insights to stay ahead of evolving vulnerabilities. The Dark Reading daily newsletter provides in-depth analysis of cloud vulnerabilities, advanced threat detection, and risk mitigation strategies. Stay informed on zero-trust architecture, compliance frameworks, and securing complex multi-cloud and hybrid environments. Signup to newsletter In today's issue, we'll look at: Google pushes Kubernetes to 130K nodes (yes, really), AWS launches an AI agent that debugs production while you sleep, and the network jobs market sends mixed signals: AI certs pay 12% more while automation threatens to eliminate a fifth of IT roles. Plus, hard lessons from recent AWS and Cloudflare outages that went global from single subsystem failures. Cheers, Shreyans Singh Editor-in-Chief Ransomware Just Hit your AWS Cloud. What Happens Next? Join us for an immersive simulation that’ll let you experience a fictionalized ransomware attack, without any of the actual consequences. You'll witness: -The first suspicious alert -The shocking depth of the breach -A heart-stopping realization about compromised backups -The impossible choice: pay or rebuild? Don't just hear about ransomware. Experience it. Learn how to be truly cyber resilient. Save My Spot This Week in Cloud How Google Built a Kubernetes Cluster with 130,000 nodes Google has been testing GKE at 130,000 nodes, twice their official support limit. They're hitting 1,000 pods/sec scheduling throughput with P99 startup under 10 seconds, used Kueue to preempt 39K pods in 93 seconds when priorities shifted, and kept the control plane stable with 1M+ objects in the datastore. The architectural wins: consistent reads from cache (KEP-2340), snapshottable API server cache (KEP-4988), and Spanner-backed storage handling 13K QPS just for lease updates. This matters because we're moving from chip-limited to power-limited infrastructure. One GB200 pulls 2.7KW, so at 100K+ nodes you're talking hundreds of megawatts across multiple data centers. Google's betting on multi-cluster orchestration becoming the norm (MultiKueue, managed DRANET). Gang scheduling via Kueue now, native Kubernetes support coming (KEP-4671). Sneak Peek into Kubernetes v1.35 K8s 1.35 is finally killing off cgroup v1 support. If you're still running nodes on ancient distros without cgroup v2, your kubelet won't start. Also deprecating ipvs mode in kube-proxy since maintaining feature parity became impossible; nftables is the way forward on Linux. On the features side: in-place pod resource updates hitting GA (no more pod restarts for cpu/memory changes), native pod certificates for mTLS without needing SPIFFE/SPIRE, numeric comparisons for taints (finally can do SLA-based scheduling with Gt/Lt operators), user namespaces maturing through beta (container root remapped to unprivileged host UID), and image volumes likely enabled by default (mount OCI artifacts directly as volumes). Node declared features are going alpha too - nodes will publish supported capabilities to avoid version skew scheduling failures. AWS launched a DevOps Agent that actually debugs production for you No more 3am war rooms might actually be realistic now. DevOps Agent is a "frontier agent" that runs autonomously for hours investigating incidents while you sleep. It connects to CloudWatch, Datadog, Dynatrace, GitHub/GitLab, ServiceNow, and builds an application topology map automatically. When stuff breaks, it correlates metrics/logs/deployments, identifies root causes, updates Slack channels, and suggests mitigations. It has a web app for operators to manually trigger investigations or steer the agent mid-investigation. The interesting part: it analyzes past incidents to recommend systematic improvements (multi-AZ gaps, monitoring coverage, deployment pipeline issues). It also creates detailed mitigation specs that work with agentic dev tools. Supports custom tool integration via MCP servers for your internal systems. AWS will manage your Argo CD, ACK, and KRO now AWS just launched EKS Capabilities: fully managed versions of Argo CD, AWS Controllers for Kubernetes (ACK), and Kube Resource Orchestrator (KRO) that run in AWS-owned accounts, not your cluster. They handle scaling, patching, upgrades, and breaking change analysis automatically. SSO with IAM Identity Center for Argo CD, ACK has resource adoption for migrating from Terraform/CloudFormation, KRO for building reusable resource bundles. This is basically AWS saying "stop running your own GitOps infrastructure." Makes sense given 45% of K8s users already run Argo CD in production (per 2024 CNCF survey). Early Bird Offer: Get 40% Off Use code EARLY40 Early Bird Offer: Get 40% Off Use code EARLY40 Deep Dive Controlling Kubernetes Network Traffic Ingress NGINX is retiring and it got me thinking about how convoluted network traffic control has become in Kubernetes. You've got your CNI for connectivity, network policies for security, ingress controllers or Gateway API for north-south routing, maybe a service mesh for east-west traffic, and honestly most apps don't need all of this. The real decision most people face is simpler: ingress controller vs Gateway API. Here's the thing: if you just need basic HTTP/HTTPS routing and you're already comfortable with nginx or Traefik, stick with ingress controllers. They work, they're stable, tooling is mature. Gateway API makes sense if you need advanced stuff like protocol-agnostic routing, cross-namespace setups, or you're running multi-team environments where role separation matters. All three clouds (AWS ALB Controller, Azure AGIC, GKE Ingress) have solid managed options for both approaches now. Gateway API is clearly the future, but "future-proof" doesn't mean you need to migrate today. Network jobs roundup: AI certs pay, skills gap persists, mixed employment signals The network jobs market is weird right now. AI certifications are commanding 12% higher pay year-over-year while overall IT skills premiums dropped 0.7%. CompTIA just launched AI Infrastructure and AITECH certs, Cisco added wireless-only tracks (CCNP/CCIE Wireless launching March 2026). Meanwhile unemployment for tech workers sits at 2.5-3% depending on who's counting, but large enterprises keep announcing layoffs while small/midsize companies are actually hiring. Skills gap is real though- 68% of orgs say they're understaffed in AI/ML ops, 65% in cybersecurity. Telecom lost 59% of positions to automation, and survey data shows 18-22% of IT workforce could be eliminated by AI in the next 5 years. But demand for AI/ML, cloud architecture, and security skills keeps growing. The takeaway: upskill in AI and automation or get left behind, especially if you're in support, help desk, or legacy infrastructure roles. Three Lessons from the Recent AWS and Cloudflare Outages AWS US-EAST-1 went down for 15 hours in October (DNS race condition in DynamoDB), Cloudflare ate it in November (oversized Bot Management config file crashed proxies globally). Both followed the same pattern: small defect in one subsystem cascaded everywhere. The lessons are obvious but worth repeating: design out single points of failure with multi-region/multi-cloud by default, use AI-powered monitoring to correlate signals and automate rollback (monitoring without automated response is just expensive alerting), and actually practice your DR plan regularly because you fall to the level of your practice, not rise to your runbook. The deeper point: complexity keeps growing with every new region and service, multiplying ways a small change can blow up globally. The answer is designing for failure: limit blast radius, decouple planes, automate validation. No provider is immune, so your architecture needs to assume failures will happen and route around them automatically. Test your DR plan with chaos engineering, not hope- Google SRE Practice Lead Google's SRE team wrote a piece on why your disaster recovery plan probably doesn't work and how chaos engineering proves it. The premise: systems change constantly (microservices, config updates, API dependencies), so that DR doc you wrote last quarter is already outdated. Chaos engineering lets you run controlled experiments—simulate database failovers, regional outages, resource exhaustion, and measure if you actually meet your SLOs during the disaster. It's not about breaking things randomly. You define steady state, form a hypothesis (like "traffic will failover to secondary region in 3 minutes with <1% errors"), inject a specific failure, and measure what happens. The key insight is connecting chaos to SLOs. Traditional DR drills might "pass" because backup systems came online, but if it took 20 minutes and burned your entire error budget, customers saw you as down. Start small with one timeout or retry test, build confidence, scale from there. Stelvio: AWS for Python devs Stelvio is a Python framework that lets you define AWS infrastructure in pure Python with smart defaults handling the annoying bits. Run stlv init, write your infra in Python (DynamoDB tables, Lambda functions, API Gateway routes), hit stlv deploy and you're done. No Terraform, no CDK yaml hell, no mixing infrastructure code with application code. 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
Success Subscribed successfully to !
You’ll receive email updates to every time we publish our newsletters.
Modal Close icon
Modal Close icon