CloudPro

01 Jul 2026

The Next Agent Problem is the Bill

01 Jul 2026

Once agents become part of daily engineering work, platform teams need to manage usageThe Next Agent Problemis the BillOnce agents become part of daily engineering work, platform teams need to manage usage like shared production capacityBefore we continue, a quick word from our sponsorSocial engineering is about manipulating people's emotions. Identify the susceptibilities that hackers use to exploit people.This NINJIO Insights Report dives into the key emotional susceptibilities that make social engineering work and offers concrete steps that your security team can take to equip your workforce to resist cyberattacks.SEE MOREHi, welcome back.Last time, we talked about least agency: giving agents thesmallest usefulamount of autonomy, instead of handing them every tool and hoping discipline appears later.This week, I want to add the next constraint:Cost.GitHub gave us a useful signal here. Copilot moved to usage-based billing in June, with AIcredits tied to token consumption. Soon after, GitHubreportedly hadits best month ever, driven by demand for AI-assisted coding. That is not just a GitHub story. It is a preview of where engineering work is going.Jensen Huang has also been talking about AI token budgets for engineers. Whether or not that becomes common compensation language, the point is hard toignore:tokens are becoming working capacity.For years, most developer toolsbehavedlike office software. You bought seats, assigned licenses, and forgot about it until renewal season. Agents do not work that way. They consume tokens, context, model capacity, tool calls, retries, logs, CI minutes, and review time.Sothe useful question for platform teams is not “is AI getting expensive?” Of course it is. The useful question is: are we operating agent usage like shared production capacity, or are we treating it like a bunch of harmless subscriptions?Here are five thingsI’dfix before agent usage becomes another mystery bill. Break the bill down by workflowThe worst version of AI spend is one large monthly number called “Copilot,” “Claude,” “OpenAI,” or “AI tools.”That number will start a finance conversation, but it will not help an engineering team make a decision.You need to know which workflows are consuming the money: code review, incident summaries, release notes, test generation, deployment helpers, log analysis, ticket triage, documentation updates, runbook execution. Once you see that split, the conversation changes. A workflow that saves ten engineers an hour every week may be worth the cost. A workflow that writes long summaries nobody readsprobably isnot.You already do this elsewhere. Shared infra gets tags. CI jobs get owners. Cloud spend gets split by service, team, or environment. Agents should not be exempt just because the invoice arrives under onevendorname.Startsimple. Track the workflow name, owner, model used, average cost per run, success rate, failed runs, retries, and whether human review was needed. That is enough to stop guessing.If you cannot connectspendto a workflow, you cannot tell whether the agent is creating value or just making the bill more interesting.2. Give agents budgets before they become popularNo one runs production services with unlimited CPU, unlimited memory, unlimited retries, and unlimited runtime. Agents should not be the exception. Every serious agent workflow needs a budget. Not just a money budget, but a behavior budget: max tokens per run, max tool calls, max retries, max runtime, max files pulled into context, max logs included, and max model tier allowed by default.The small leaks are usually the ones that hurt. A code review agent reads too much context. A troubleshooting agent keeps retrying the same weak path. A release agent generates a long report, then generates three polished versions of the same report. A helper tool uses the most expensive model because nobody changed the default. None of this looks dramatic in one run. But at the team scale, it becomes capacity.This is where platform engineering habits help. You do not need to ban usage. You need sane defaults. Most workflows should start with limits, then earn higher limits when the value is clear.3. Route work to the model it deservesA lot of agent cost comes from using the strongest model for the weakest job. A deployment summary does not need the same model as a multi-step incident investigation. A formatting task does not need the same model as risky code generation. A first-pass log explanation does not need the same model as cross-service root-cause analysis.Create tiers. Use cheaper models for summarization, classification, formatting, and routine explanations. Reserve the expensive models for work where reasoning quality actually changes the outcome: incident analysis, architecture trade-offs, complex code changes, migration planning, and workflows that touch production state. This is not about being cheap. It is about not using a crane to move a laptop.You already right-size infrastructure. You choose instance types, storage classes, queue sizes, and retention windows based on the workload. Agent workflows need the same treatment.The question is not “which model is best?” The better question is “which model is enough for this step?”4. Put retry loops on a leashRetries are where agent workflows quietly become expensive and annoying. A failed request is one thing. An agent that keeps re-reading logs, re-planning, re-calling tools, expanding context, and trying again can burn tokens without moving the problem forward.This is also where cost and safety meet. When an agent is stuck, you do not want it to spend more money becoming more confident about the wrong path. You want it to stop, summarize what it tried, and hand the problem back with evidence.So, define the loop rules before the loop runs. How many retries are allowed? What counts as progress? Which failures stop the run? When does the workflow move from “act” to “suggest”? When does a human need to step in?If you use Ansible, this instinct is already familiar. A playbook with bad exit behavior is not resilient. It is noisy. An agent loop has the same problem, except the noise now comes with token cost.A good agent workflow needs a circuit breaker.5. Add cost review to the rollout checklistBefore an agent workflow moves beyond a small group, ask the boring questions.What should a successful run cost? What does a failed run cost? What happens if fifty engineers use it every day? What happens during an incident when everyone runs it at once? Who owns the budget? Who gets alerted when usage spikes? What gets turned off first? Put these beside the safety questions. If the agent can change systems, review its permissions. If the agent can consume shared capacity, review its limits. Both belong in the rollout conversation.This does not have to become a committee. It just needs an owner and a threshold. If usage doubles, someone should know. If a workflow starts burning budget through retries, someone should see it before the month ends. If a premium model is being used for low-value work, someone should be able to move it down a tier.The worst time to discover the cost model is after everyone likes the workflow.The bigger point is simple. Last week’s issue asked how much autonomy an agent should get. This week’s question is how much it should be allowed to consume. Those two questions belong together. An agent with no permission boundary can break things. An agent with no consumption boundary can quietly become expensive, slow, and hard to defend. If it is part of the platform, operate it like part of the platform.Meter it. Cap it. Route it. Attribute it. Review it.Before approving an agent workflow, ask two questions. Is it safe enough to run? And is it worth repeating at scale? Because once agents move into daily work, the bill is not a surprise. It is telemetry.One last thing before I go, we do have a couple of events that are upcoming up specifically related Claude DevOps & GitOps Platform engineering which are a definite value add how to utilise your AI credits as well to build systems at scale.Agentic DevOps with Claude | July 23rdEarly Bird Live Now, 40% Off, last 48 hours before it’s sold out!Claude Code is the engineer. You’re watching it work.Four hours. A 33-component AI-native IDP built live on a real Kubernetes cluster, ArgoCD, Backstage, kgateway, observability stack included. The cluster is provisioned for you. You leave with the repo and a working reference architecture to take back to your team.Michael Rishi Forrester from Accenture, prev- KodeKloud is running this one. Limited Seats📅 Thursday, July 23rd | 11:00 AM EDTBOOK YOUR TICKETSAnd if last week's platform engineering conversation left you wanting to go deeper not just understand where AI belongs in the stack but actually watch it build the stack that is what July 23rd.AI-Powered GitOps and Platform Engineering Workshop | July 30thEarly Bird Live Now, 40% Off, last 4 days before it’s gone!Your AI agent doesn’t know your manifests are stale. That’s the whole problem.Three hours. Real ArgoCD and Flux workflows, live demos comparing fresh vs. stale context, and hands-on labs turning repeated agent tasks into tooling your team actually keeps. You’ll walk through drift detection, change review, and validation, the parts of platform engineering AI tools usually get wrong because nobody fed them current context.Taylor Dolezal from Dosu, an AI-native knowledge infrastructure for agents and humans is running this one, drawing on patterns from 100,000+ repos. Limited seats.📅 Thursday, July 30th | 11:00 AM EDTBOOK YOUR TICKETSIf you’re a regular CloudPro reader, I’d like to hear what you want covered next: agent cost models, MCP security, platform AI governance, agentic DevOps workflows, or the messy parts of actually getting these systems into production.Hit reply and tell me what would be most useful for your team.Cheers,Apramit BhattacharyaEditor-in-Chief*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Apramit from Packt

23 Jun 2026

The gap between you and the next role is probably an event away

Apramit from Packt

23 Jun 2026

Details shared, maybe worth a look?Least Agency Beats More AgencyWhat the AI-native platform discussion got right, and why it matters next for network and Linux operationsHi, welcome back.Starting with the next edition, you’ll receive this newsletter directly from our Substack, so keep an eye out for it there.Last week, we looked at AI-native platform engineering, and the conversations around it raised a useful argument for anyone working with agents: the future is not "give agents more autonomy." It is "give them the smallest useful amount of autonomy", a more uncomfortable idea than it sounds, since most AI demos move the opposite way: more tools, more context, more access, more actions. It's impressive because the agent can do a lot, but real infrastructure optimizes for known scope, clear ownership, safe failure, and boringly good recovery.The phrase that stuck with me was least agency, in the same spirit as least privilege. Least privilege asks, "What is the minimum access this thing needs?" Least agency asks, "What is the minimum decision-making power this thing needs?" That is the real split: autonomous by default vs constrained by design.You can see that problem play out in two places most CloudPro readers know well: network operations and the Linux command line.Engineering Agentic Network Operations, LAST 10 SEATS!That's why, on June 30th, we're bringing together William Collins and John Capobianco for a live, hands-on workshop to help you move from AI experimentation to production-ready Agentic NetOps.You'll learn how to:• Build MCP skills that are composable, reusable, and production-ready, not just proof-of-concepts• Apply spec-driven development to define and constrain agent behavior before it creates operational risk• Design resilient agentic loops with failure recovery patterns and safe handoffs for real-world network environmentsThrough live instruction, demonstrations, and guided labs, you'll gain practical experience across a modern AI networking stack including MCP, OpenClaw, NetClaw, Python, Claude, Selector AI, and Containerlab.🚨 Only 10 seats remain for the live session.Use code LIMITED40 at checkout to receive 40% off your registration, available exclusively for you.BOOK YOUR SEAT NOW📅 Tuesday, June 30 | 9:00 AM–12:00 PM EDTAgentic Linux – From Commands to AI AgentsThis eventis on27 June, and it brings the same question to the shell.Taught by Imran Afzal, a best-selling Udemy instructor who's trained 1M+ students and spent 25+ years running infrastructure at Fortune 500 companies before he started teaching it, this session brings the same question to the shell. AI can write commands and explain logs in seconds, but Linux does not forgive blind execution.You'll work hands-on with ChatGPT CLI, Shell Genie, Aider, and Goose CLI, with review and human approval kept at the center throughout. If you're a Linux admin, DevOps engineer, SRE, or technical learner, this is a way to try agentic CLI workflows without treating your terminal like a toy, guided by someone who's actually run that terminal for a living.Join it live while there's still room.Use code EARLY40 to get 40% offBOOK YOUR SEAT NOWRemember, an agent should know when it is missing context, when it is guessing, when the next step changes state, and when to hand control back. Otherwise, you get the ops version of the "Automation is great" meme;That is where "human in the loop" gets exposed, if the agent moves faster than the review path, the human is not in the loop, just watching the replay.The platform engineering conversation made one thing clear to me:AI-native operations will not be won by the teams that hand agents the biggest set of keys. It will be won by the teams that design thesmallest usefulloop, then make that loop observable, reviewable, and safe enough to run again.I’llbe in attendance atthesesessions as well.If this is already showing up in your team’s roadmap, this is the wrong one to bookmark and forget. The useful questions usually come up live, in the labs, when the agent does something slightly awkward,and everyonehas totalk through what should happen next.Cheers,Apramit BhattacharyaEditor-in-Chief.P.S. If you attended the platform engineering session or have an opinion on today’s topic, do reply to this mail. A good question from the chat counts.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Apramit from Packt

16 Jun 2026

Your AI agents don't need more autonomy

Apramit from Packt

16 Jun 2026

What it means for network and Linux opsLeast Agency Beats More AgencyWhat the AI-native platform discussion got right, and why it matters next for network and Linux operationsHi, welcome back.Starting with the next edition, you’ll receive this newsletter directly from our Substack, so keep an eye out for it there.Last week, we looked at AI-native platform engineering, and the conversations around it raised a useful argument for anyone working with agents: the future is not "give agents more autonomy." It is "give them the smallest useful amount of autonomy", a more uncomfortable idea than it sounds, since most AI demos move the opposite way: more tools, more context, more access, more actions. It's impressive because the agent can do a lot, but real infrastructure optimizes for known scope, clear ownership, safe failure, and boringly good recovery.The phrase that stuck with me was least agency, in the same spirit as least privilege. Least privilege asks, "What is the minimum access this thing needs?" Least agency asks, "What is the minimum decision-making power this thing needs?" That is the real split: autonomous by default vs constrained by design.You can see that problem play out in two places most CloudPro readers know well: network operations and the Linux command line.Engineering Agentic Network Operations, LAST 10 SEATS!That's why, on June 30th, we're bringing together William Collins and John Capobianco for a live, hands-on workshop to help you move from AI experimentation to production-ready Agentic NetOps.You'll learn how to:• Build MCP skills that are composable, reusable, and production-ready, not just proof-of-concepts• Apply spec-driven development to define and constrain agent behavior before it creates operational risk• Design resilient agentic loops with failure recovery patterns and safe handoffs for real-world network environmentsThrough live instruction, demonstrations, and guided labs, you'll gain practical experience across a modern AI networking stack including MCP, OpenClaw, NetClaw, Python, Claude, Selector AI, and Containerlab.🚨 Only 10 seats remain for the live session.Use code LIMITED40 at checkout to receive 40% off your registration, available exclusively for you.BOOK YOUR SEAT NOW📅 Tuesday, June 30 | 9:00 AM–12:00 PM EDTAgentic Linux – From Commands to AI AgentsThis eventis on27 June, and it brings the same question to the shell.Taught by Imran Afzal, a best-selling Udemy instructor who's trained 1M+ students and spent 25+ years running infrastructure at Fortune 500 companies before he started teaching it, this session brings the same question to the shell. AI can write commands and explain logs in seconds, but Linux does not forgive blind execution.You'll work hands-on with ChatGPT CLI, Shell Genie, Aider, and Goose CLI, with review and human approval kept at the center throughout. If you're a Linux admin, DevOps engineer, SRE, or technical learner, this is a way to try agentic CLI workflows without treating your terminal like a toy, guided by someone who's actually run that terminal for a living.Join it live while there's still room.Use code EARLY40 to get 40% offBOOK YOUR SEAT NOWRemember, an agent should know when it is missing context, when it is guessing, when the next step changes state, and when to hand control back. Otherwise, you get the ops version of the "Automation is great" meme;That is where "human in the loop" gets exposed, if the agent moves faster than the review path, the human is not in the loop, just watching the replay.The platform engineering conversation made one thing clear to me:AI-native operations will not be won by the teams that hand agents the biggest set of keys. It will be won by the teams that design thesmallest usefulloop, then make that loop observable, reviewable, and safe enough to run again.I’llbe in attendance atthesesessions as well.If this is already showing up in your team’s roadmap, this is the wrong one to bookmark and forget. The useful questions usually come up live, in the labs, when the agent does something slightly awkward,and everyonehas totalk through what should happen next.Cheers,Apramit BhattacharyaEditor-in-Chief.P.S. If you attended the platform engineering session or have an opinion on today’s topic, do reply to this mail. A good question from the chat counts.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Apramit from Packt

10 Jun 2026

Before You Give AI More Access, Build the Boundaries

Apramit from Packt

10 Jun 2026

Three hands-on sessions for platform, network, and Linux teams trying to make agentic workflows usefBefore You Give AI More Access, Build the BoundariesThree hands-on sessions for platform, network, and Linux teams trying to make agentic workflows useful without making them reckless.Hi!In this issue,I want to point you toward three sessions worth your time if you work near platforms, networks, Linux, automation, or incident response.They are connected by one theme:AI in operations is only useful when the boundaries are engineered first.That has been the thread running through recentCloudProissues, even when the topics looked unrelated.In the last few issues,we’vekept coming back to the same engineering instinct: slow down, separate the moving parts, and understand what should happen before you let anything change.Whether we were talking about broken containers, MCP design, or admin rights, the point was never just the tool. It was the discipline around the tool.AI does not remove that habit. It raises the cost of skipping it. The teams that get value from agents will not be the ones thatconnectthe most tools. They will be the ones that know what the agent can see, what it can change, when it must stop, and how a human verifies the next step.Taken together, these sessions give you a practical way to think about AI where it actually shows up in your work: inside the platform, inside network operations, and at the Linux command line.FINAL CALL - Building AI-Native Platform Engineering SystemsLast 24 hours!Start at the platform layer.If you run an internal platform, you've probably felt the pressure: AI in the developer portal, agents that explain failures, golden paths that do more than point to templates.The challenge is that platform agents need context, telemetry, policy, safe actions, approvals, and clear trust boundaries. AI can't just be bolted on.That's why Building AI-Native Platform Engineering Systems stands out. It covers Backstage, OpenChoreo, CI/CD, golden paths, control planes, policy-as-code, observability, guardrails, and platform intelligence - with AI treated as part of the architecture, not an afterthought.A solid read for platform engineers, DevOps engineers, SREs, architects, and engineering leaders asking:"Can we make our platform AI-native?"The answer may be yes, but only with the right control surfaces.Use code FINAL50 to get 50% offBOOK YOUR SEAT NOWThe session is on June 11, i.e., 24 hours from now, and you can use the discount code FINAL50 when booking.Engineering Agentic Network OperationsOnce platform boundaries are in place, the next challenge is network operations, where many AI demos fall apart.It's easy to build an agent that summarizes data or suggests a config. It's much harder to build one that works within real constraints, recovers from failures, and knows when to hand control back to a human.That's the focus of Engineering Agentic Network Operations.The workshop covers MCP, OpenClaw, NetClaw, FastMCP, Containerlab, Arista cEOS, and more. But the real value is learning how to design agentic workflows with clear scope, guardrails, recovery paths, and human oversight.For network automation, SRE, platform, and infrastructure teams, those questions matter long before an agent is trusted with production operations.Use code SPECIAL50 to get 50% offBOOK YOUR SEAT NOWAgentic Linux – From Commands to AI AgentsThen there's the base layer.No matter how advanced the platform, a lot of real work still ends up in Linux - reading logs, checking services, fixing permissions, troubleshooting errors, and deciding whether a generated command is actually safe to run.That's why Agentic Linux – From Commands to AI Agents is a useful third piece.The workshop covers ChatGPT CLI, Shell Genie, Aider, Goose CLI, Bash scripting, troubleshooting, automation, and system administration tasks. What I like is that it treats AI as an assistant, not a replacement for judgment.Generating a command is easy. Knowing whether it's safe, spotting destructive actions before they run, and keeping humans in the approval loop is the hard part.For Linux administrators, DevOps engineers, SREs, and IT professionals, it's a practical look at using AI at the command line without turning the shell into a roulette wheel.Use code EARLY40 to get 40% offBOOK YOUR SEAT NOWAfewbooksto keep beside these sessionsOne more thing worth mentioning: these sessions pair well with a few books from our list if you want to go deeper after the labs.For the first workshop,I’dstart withPlatform Engineering for Architects, which is especially relevant because attendees ofBuilding AI-Native Platform Engineering Systemsget the book for free.Mastering Enterprise Platform Engineeringis the natural follow-up ifyou’rethinking about platform engineering, delivery workflows, and generative AI at enterprise scale. For the network operations side,AI Networking Cookbook fits neatly with the agentic network operations session. And if the Linux workshop is closer to your day-to-day work,The Ultimate AI Guide for Linux Engineers is a good companions for thinking about what AI-generated commands are actually touching underneath.The Ultimate AI Guide for Linux EngineersBUY NOW ON AMAZONMastering Enterprise Platform EngineeringBUY NOW ON AMAZONAI Networking CookbookBUY NOW ON AMAZONPlatform Engineering for ArchitectsBUY NOW ON AMAZONThe useful question is not whether AI can be added to these workflows. It can, and most teams are already experimenting with it somewhere. The harder question is whether the workflow still makes sense once AI is inside it. Can the team understand what happened? Can they review the output? Can they stop a bad change before it lands? Can they trust the system more after adding AI, not less?That is the thread running through all three sessions. They are not about chasing the newest tool. They are about taking the work many of us already do across platforms, networks, and Linux environments, and asking whathas tochange when AI becomes part of that work.That is what makes these sessions useful. They are not abstract AI talks. They sit close to the places many of usactually work: the platform layer, network operations, and the Linux command line.I’llbe in attendance as well, because these are exactly the conversationsCloudProneeds to keep having: what is useful, what is unsafe, what is ready, and what still needs a human in the loop.This istime-sensitive. The first session is on June 11, and the other two follow later this month. If one of these areas is on your roadmap, I would not skip it.Cheers,Apramit BhattacharyaEditor-in-ChiefP.S. If you attend any of these sessions, I’d be interested to hear what actually helped: the demo, the lab, the framework, or even a question someone asked. Reply with one thing you took away from it.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}.row .side{display:none}}

0
0

Apramit from Packt

01 Jun 2026

MCP's security crisis isn't new. It's just faster.

Apramit from Packt

01 Jun 2026

It's the non-human identity problem you already know, surfacing where you didn't lookJune 20th 9 AM EDT| EXCLUSIVE OFFER 40% OFF - USE CODE LIMITED40BOOK YOUR TICKETS150+ engineers from 30+ countries attended our last cohort.The biggest Networking Automation experts, William Collins, Director of Tech Evangelism and John Capobianco, Head of AI & Developer Relations at Itential and are here! Get your tickers right away! Offer ends soonJune 11th 11:30 AM EDT | EXCLUSIVE OFFER 40% OFF- USE CODE SAVE40BOOK YOUR TICKETSIt's a hands-on cohort for teams already running cloud-native platforms who now have to evolve them into AI-native ones without weakening the controls underneath.Great experts & panelists joining along with other platform engineers, SREs, architects, and the leads who own those decisions.(Ps- Free Platform Engineering for Architects e-book exclusively for you!)Hi, Shreyans here.Before wegetinto this week’s issue, I wanted to share a small but important update.Apramitwill be taking over as the new editor-in-chief ofCloudProfrom this issue onward.Apramithas been withPacktfor over2years, working closely with technical content, authors, and practitioner-focused communities, and he brings exactly the kind of editorial judgment this newsletter needs: useful over noisy, practical over hyped, and grounded in what engineersactually careabout.Over to you, Apramit.Hi,I’mApramit.I’m excited to take over as editor-in-chief of this newsletter and grateful for the direction already set by Shreyans. My focus will be simple: to keep this newsletter useful, sharp, and grounded in what readers actually need. We’ll continue to look past the noise, ask practical questions, and make space for clear, thoughtful conversations around technology, publishing, and the people building with it.Now, let's continue!MCP's security crisisisn'tnew.It'sjust faster.Ifyou'vestood up an MCP server in the last year, this one's worth two minutes.The reportingaround agentic AI security keeps framing it as a brand-new threat. The more useful way to see it, adapted here fromOperational AI with Dockerby Ajeet Singh Raina and Harsh Manvar, is as the non-human identity problem we already know how to solve, surfacing in a place most teamshaven'tlooked yet.Here'sthe versionI'dsend to anyone running agents in production.MCP's credential problem is the old IAM problem in new clothesMost teams are securing their agents with shared API keys scattered across env vars and config files. That's the non-human-identity problem at machine scale and there's a clean pattern that closes it.Every few years,the industry rediscovers a problem italreadysolved, gives it a new name, and acts surprised. The "agentic AI security crisis" is this year's edition. When I read that only 22% of teams treat their agents as independent identities, and that 88% have already had or suspected a security incident, my honest reactionwasn'talarm.It wasrecognition.We'veseen this exact shape before, with service accounts and CI runners and every other non-human thing wehandeda credential and then forgot about. The CISO line making therounds,that MCP will be the AI security issue of 2026, isprobably right. It justisn'tnew.What changed is the speed. MCP made it trivially easy to give an agent real capabilities(a filesystem, a database, a GitHub account),and teamswiredthose up the way you wire things up whenyou'retrying to ship. The credentials wentwhereverwas convenient. That convenience is the whole problem.Two failure modes matter. The first is the obvious one:secrets exported as environment variables. The moment you do that, the key is sitting in your process list, in the output of a container inspect, in logs and stack traces, and in Git history,the instant a setup script gets committed. The second is quieter and worse.Each server manages its own credentials, so three servers that all need a GitHub token means three copies of that token, storedinthreedifferent ways.There'sno single place to rotate them, and no single action that revokes access. If one leaks,you'rehunting.The fix is a mediation layer. Route credentials through onebrokerso the agent never holds the raw secret, scope each agent to only the tools itactually needs, and make revocation a single action instead of a scavenger hunt. The book uses Docker's MCP gateway as its worked example:one endpoint every client connects through, backed by a single secret store.But the productisn'tthe point;the pattern is. Giving an agent an "identity"isn'ta philosophical move; it just means you can grant, scope, and revoke its access in one place, the same way you would for a person.None ofthisneedsnew technology. It needs the access disciplineyou'dnever skip for a human user, applied to the agents you quietlyhaven't.This article has been adapted fromOperational AI with Dockerby Ajeet Singh Raina and Harsh Manvar. If you want the full playbook, the book takes you from a model running on your laptop to secure, scaled agentic systems in production, with hands-on coverage of Docker Model Runner, MCP, multi-agent architectures, and Kubernetes orchestration.GET THE BOOKTHE ULTIMATE LINUX & SYSADMIN BUNDLE | 24 BOOKS | FROM $18GET YOUR BUNDLE!HUMBLE-BUNDLE is here! 24 Packt titles covering everything you need acrossLinux, SysAdmin, security, and infrastructure. Total MSRP across all 24 books is >$1,000. Bundle starts at $18. Part of every purchase supports the Prevent Cancer Foundation. Offer ends in 15 days.New here?PacktCloudProis a newsletter fromPacktfor senior cloud, DevOps, and platform engineers who want the call, not the concept. We focus on decision frameworks and real trade-offs for readers who already know the fundamentals anddon’tneed another explainer.If something landed or missed, hitreplyand tell me.I read every response, andit'show I figure outwhat'sworth running more of.Want to subscribe, or promote your product to this audience?Reach out to me directly.SUBSCRIBE*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

22 May 2026

Why most AI agents break in production

Shreyans from Packt

22 May 2026

Weekend Sale is LIVEJoin 100+ Engineers and build agentic network automation with MCP.Book Your SeatUse Code WEEKEND40 to get 40% OffIf you've been trying to build agentic workflows on top of MCP and hitting the usual walls — agents that stall halfway through a task, tools that don't compose well, loops with no real recovery path: this is the event for that.It's 3.5 hours, hands-on, in a Containerlab environment. William Collins and John Capobianco are running it. Both are at Itential and have been working on this stuff longer than most people in the space. John wrote Automate Your Network. William runs their tech evangelism. They know what breaks in production because they've watched it break.What you'll actually do in the session:Take raw MCP tools and turn them into skills you can reuse across agentsUse spec-driven development so the agent stays bounded to what you asked forBuild recovery into your loops instead of catching exceptions after the factWork through all of it on OpenClaw, NetClaw, and ContainerlabIt's not an intro session. This one assumes you've already tried to build something and want to do it properly. 100+ engineers have signed up already, join them.Book Your SeatUse Code WEEKEND40 to get 40% OffBuilding AI-Native Platform Engineering Systems: Last 10 seatsBook Your SeatUse Code WEEKEND40 to get 40% OffBrowse Our Networking TitlesAI Networking CookbookGET 40% OFFNetwork Automation CookbookGET 40% OFFNetwork Automation with NautoBotGET 40% OFFMastering Python NetworkingGET 40% OFF*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

18 May 2026

Your AI Assistant Doesn't Need More Access. It Needs a Tier List

Shreyans from Packt

18 May 2026

A simple model for deciding which platform actions an AI can run on its own — and which it shouldn'tCloud attacks have a new entry point. It's your running applications.That’s why a new category is emerging: Cloud Application Detection and Response (CADR).This guide breaks down what CADR is, why runtime is the only place real attacks can be detected, and how security teams are protecting applications, cloud infrastructure, and AI systems in production.If you’re responsible for securing modern cloud workloads, this is a concept you’ll want to understand.Get the GuideI wrote a while back about how MCP turns your internal platform into something an AI can act on, not just answer questions about. The follow-up question I kept getting was the practical one: fine, but how much do you actually let it do? This issue is the answer: a straightforward way to draw that line before the assistant does something you can't walk back.Cheers,Shreyans SinghEditor-in-ChiefYour AI Assistant Doesn't Need More Access. It Needs a Tier List.A simple model for deciding which platform actions an AI can run on its own, and which it shouldn't.Workshop: Building AI-Native Platform Engineering Systems Saturday, May 30 · Online · live and hands-on (5 hours)If the guardrails question in this issue is live for your team, the workshop is where it gets the full treatment. It's a hands-on cohort from Packt, in collaboration with FAUN.dev(), on evolving cloud-native platforms into AI-native ones: internal developer portals, golden paths, telemetry, and a dedicated session on policy-as-code, governance, and guardrails (exactly what's below).It's built for platform engineers, SREs, architects, and tech leads already running cloud-native platforms who want to add AI deliberately, without giving up control. Speakers: Asanka Abeysinghe, Dr. Gautham Pallapa, Mark Peter, and Thiago Shimada Ramos. Every attendee also gets a free Platform Engineering for Architects e-book.Use code SAVE50 for 50% offBook Your SeatOnce you wire an AI assistant into your platform through MCP, it stops being a chat window. It can deploy, scale, roll things back, actually do the work. Which is great, right up until you notice nobody decided what it's allowed to do on its own. On most teams that call never gets made deliberately; it just happens, one engineer and one service at a time.MCP is good at the wiring. It exposes your operations as tools the assistant can call. What it doesn't hand you is the judgment about which of those tools should sit one approval away and which shouldn't. That part you build yourself.The model that works is simpler than it sounds. Take any action the assistant could perform and ask two things: if it goes wrong, can you undo it, and how far does the damage spread. That gives you three tiers.Low-risk actions are reversible and contained: querying logs, reading metrics. Let the assistant just do those. Making someone approve a log query is the kind of friction that teaches people to stop using the tool.Medium-risk actions are reversible but have real blast radius. Scaling a service is the obvious one. You can scale it back, but in the meantime you've moved cost and capacity for everything downstream. These should draft a plan and route to an approver.High-risk actions are the ones you can't take back: deleting a database is the standard example. Those stay blocked by default, and the way through is a formal approval path, not a quick thumbs-up in a chat thread.The tiers themselves aren't really the interesting part, though. The interesting part is deciding the line once, as policy, for the whole org. Skip that and every team draws its own boundary: one ships with approvals, the next one skips them, and you've rebuilt the exact inconsistency you adopted a platform to kill.The other thing worth saying: this only holds if the safe path is also the easy path. If approval is slow and annoying, people find ways around it, and your guardrails quietly turn into a cage. So auto-execute the genuinely safe stuff generously. That's what makes the gated stuff feel worth the wait. And log everything, every tier, no exceptions.Decide the tiers before you connect the assistant. Doing it afterward usually means doing it in response to something you wish hadn't happened.Move from raw MCP tools to composable skills, apply spec-driven development to constrain agent behavior, and design agentic loops that recover when they get stuck. Hands-on with OpenClaw, NetClaw, and Containerlab.Book Your SeatUse Code EARLY40 to get 40% Off*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

11 May 2026

Get 50% off on AI Platform Engineering Workshop

Shreyans from Packt

11 May 2026

50% OFF for 72 Hours OnlyBook Your SeatUse Code SAVE50Most platform teams trying to add AI right now are running into the same problem: bolting LLMs onto a cloud-native platform creates more issues than it solves. Governance gets weaker, control surfaces get fuzzier, and the platform that worked at scale yesterday starts looking fragile.AI-native platforms aren't cloud-native platforms with AI on top. They're a different architecture.What you'll walk away with:A working approach to designing the platform intelligence layer: data, inference, telemetry, controlThe OWASP LLM Top 10 mapped to internal platforms: guardrails, trust boundaries, jailbreak preventionHands-on exposure to Backstage, OpenChoreo, Crossplane, and Kiro/Kiro-cli used togetherA practical roadmap framework to define your team's AI-native platform plan and tie it to developer productivity and business outcomes.Running it: Asanka Abeysinghe, Dr. Gautham Pallapa, Mark Peter, and Thiago Shimada Ramos. In collaboration with FAUN.dev(), where engineers from GitHub, Netflix, and Shopify go to stay ahead.Book Your SeatMove from raw MCP tools to composable skills, apply spec-driven development to constrain agent behavior, and design agentic loops that recover when they get stuck. Hands-on with OpenClaw, NetClaw, and Containerlab.Book Your SeatUse Code EARLY40 to get 40% OffJoin Spec-Driven Development Cohort 2. If you’re already using AI to code and want more reliable, scalable outputs, this is worth your time. Seats are filling fast, we’ll be closing registrations soon.Book Your SeatUse code SDD45 to get 45% off.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

08 May 2026

24 hours until our Flagship AI Agents Workshop

Shreyans from Packt

08 May 2026

Last 10 SeatsLAST 10 SEATS | WORKSHOP in 24 HOURSGet 40% OffUSE CODE FINAL40In 24 hours, 100+ network engineers will spend 4 hours building AI agents for network operations: the right way, from the architecture up. 10 seats left. This is your last chance to join them.You'll build a working AI agent from scratch using LLM tool calling and agentic loops.Production-adapted code you can run in your own environment the same week.You'll deploy it against real Arista cEOS devices in Lab 4You'll join a live panel with Sif Baksh and Eduard DulharuThe workshop is led by Sif Baksh: 15 years across NetOps and SecOps. He's spent a decade and a half turning operational chaos into systems people can reason about: SOAR workflows, DDI migrations, and lately AI-assisted automation that survives contact with production.Joining him on the panel: Eduard Dulharu, CTO and Co-Founder of vExpertAI GmbH in Germany. He'll be bringing the founder-CTO lens on what production AI systems actually look like in operational environments.What's different on Monday if you attend:You'll know exactly how to design AI agents that don't hallucinate device commands or break running configurations under load.You'll have agentic patterns that account for how network operations actually work: state, ordering, idempotency, rollbackAnd you'll have a working code template you can adapt to your own environment, against your own devices, the same day.LAST 10 SEATS | WORKSHOP in 24 HOURSBook Your Seat Now at 40% OffUSE CODE FINAL40After this workshop, you actually start deploying AI agents in network operations with confidence.Hope to see you there,- Packt Conferences*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

04 May 2026

A Tines principal architect is teaching network engineers to build AI agents this Saturday.

Shreyans from Packt

04 May 2026

Last Chance to JoinJoin 100+ network engineers this Saturday and build a working AI agent from scratch, deployed against real Arista cEOS devices.LAST CHANCE TO JOIN | GET 40% OFFUse Code FINAL40 | Expires in 48 HoursSif Baksh (Principal Solutions Architect, Tines) is running this workshop. He built this curriculum from production work, not slides. What you get from this workshop:Lab 4: you ship a working agentic bot against real network infrastructureThe P.E.N.E. framework: built specifically for how network engineers communicate intent to LLMsReusable code you own, can modify, and can debug at 2am when something breaksP.S. You leave with the code. Not demos or homework. That's the whole point of the workshop.LAST CHANCE TO JOINWalk away with patterns for where AI fits, how governance holds up, and a 6–12 month roadmap to evolve your platform deliberately.Book Your SeatUse Code SAVE30 to get 30% OffMove from raw MCP tools to composable skills, apply spec-driven development to constrain agent behavior, and design agentic loops that recover when they get stuck. Hands-on with OpenClaw, NetClaw, and Containerlab.Book Your SeatUse Code SAVE30 to get 30% Off*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

27 Apr 2026

Most "I Need Intune Admin Rights" Requests Aren't About Admin Rights

Shreyans from Packt

27 Apr 2026

A clean way to decide when users need elevation — and when they just need the right app delivery. Local intune admin rights are one of those problems every Windows shop runs into eventually, and EPM is usually the first answer that comes up. But how you roll it out, and who you actually license for it, makes a real difference to both your security posture and your budget. Today's CloudPro issue is adapted from Mastering Endpoint Management using Microsoft Intune Suiteby Saurabh Sarkar and Rahul Singh, and it lays out a clean way to think through the decision. Cheers, Shreyans Singh Editor-in-Chief Early Bird Offer: Use code EARLY40 to get 40% Off Book Your Seat Now CloudPro #124: Most "I Need Intune Admin Rights" Requests Aren't About Admin Rights Read on Web Every security team eventually gets there: local admin rights have to go. Fair enough- they're a known weakness, and leaving users as admins on their own machines isn't defensible anymore. So you start looking at EPM, and the easiest thing to do is buy licenses for everyone, turn on self-elevation, and call it solved. Pause before you do that. You'll spend more than you need to and you'll be using the tool for something it isn't really designed for. EPM gets pitched, and often understood, as "the safe way to give users admin rights." That framing is the source of most of the trouble. EPM isn't an admin-rights replacement and it isn't an app delivery mechanism. It's elevation control. It lets a specific application run with admin privileges in a specific moment, under a rule you've defined. That's a much narrower job than "make this user an admin," and once you see the difference, the licensing decision gets a lot easier. Start with app delivery, because most "I need admin" requests aren't actually about admin. They're about getting an app installed. If you push your common business apps via Intune as Required, they install in the system context and the user doesn't need elevation at all. For the long tail of apps where you're not sure who needs what, make them Available through Company Portal. The user installs them on demand, still in system context, still without elevation. Get this layer right and a huge chunk of the supposed "need for admin rights" disappears. What's left after that is the actual EPM territory. Someone needs to install something niche that isn't in your catalog and never will be. A support engineer needs to run ProcMon elevated to debug a real issue. A developer needs an elevated PowerShell window. A user needs to restart a stuck service. These are the cases EPM was built for: controlled, rule-based elevation for specific apps and specific moments. Worth assigning a license for. Worth setting up properly. One distinction worth being clean about: EPM is not application control. If your goal is to stop certain apps from running on your devices, EPM doesn't do that. It decides what gets elevated, not what gets to run in the first place. App Control for Business is the right tool for that, and conflating the two leads to policies that don't do what you think they do. Which brings the licensing piece into focus. EPM licensing is per-user, and the population that genuinely runs elevated workloads is almost always a fraction of your fleet: engineers, IT support, certain power users. The rest of your users get their apps through Required and Available and never hit an elevation prompt. Licensing everyone "just in case" is the most common way teams overspend on the Intune Suite, and it usually happens because the team skipped the app-delivery question and went straight to the elevation question. Get the delivery layer right first. EPM is for what's left over. Read on Web FLASH SALE: 40% OFF | 24 HOURS ONLY Book Your Seat Now Catch the latest HubSpot Developer Platform updates in Spring Spotlight Explore Now 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}}

0
0

Shreyans from Packt

20 Apr 2026

Three Rules for Designing an MCP Server You Won't Regret

Shreyans from Packt

20 Apr 2026

Split read and write. Build specific tools. Validate locally.Early Bird Offer: Use code EARLY40 to get 40% OffBook Your Seat NowCloudPro #123: Three Rules for Designing an MCP Server You Won't RegretRead on WebThere's a decent chance you're about to build an MCP server for your observability platform, or your team just shipped one. Before it becomes the thing everyone depends on, three design choices will save you a rewrite later.This matters because once agents start relying on your MCP server, the contract gets sticky. You can change the implementation, but renaming tools or restructuring scopes breaks every instruction file and agentic workflow already pointing at them. The first version tends to become the permanent version.Rule one: don't mix read and write.It's tempting to expose everything from a single server: get_logs and list_problems alongside the handy create_workflow and send_notification stuff. Don't. Put read-only data access in one server and anything that modifies state or triggers actions in a second, separately installed one.Two reasons. The obvious one is accidental writes. The less obvious but bigger one is prompt injection. If an agent is connected to both, a poisoned log line or a malicious doc in your RAG pipeline can talk the agent into calling an admin action it had no business calling. Splitting servers shrinks that attack surface. And yes, the MCP spec lets users disable individual tools, but realistically most people leave everything on.Rule two: don't ship only a generic execute_query toolEvery observability backend has a query language: NRQL, DQL, PromQL, whatever yours is. You could expose just execute_query and let the agent figure out the syntax. It'll work, but badly. The agent guesses, gets a syntax error, retries, refines, retries again. Every round-trip costs API calls, tokens, and latency.Build purpose-specific tools for your top use cases alongside the generic one. A get_logs tool taking a timespan and workload identifier will run a clean, optimized query on the first try. No guessing.Don't overcorrect though. get_logs_from_k8s, get_logs_from_hosts, get_logs_from_apps: now you're maintaining ten tools and the agent picks the wrong one half the time anyway. Aim for the middle.Rule three: validate locally before hitting the backendWhen the agent sends a query, check it inside the MCP server first. Is the syntax valid? Does the caller actually have permission to access that data? These are cheap checks and they kill the trial-and-error loop at the source. Without them, every bad query becomes a billable backend call.Instrument your MCP server and you'll see the pattern fast: most failures are the same two or three mistakes. Catch them locally.These aren't the only mistakes you can make, but they're the ones that compound quietly until you're rebuilding the server at version two. Worth getting right the first time.Read on WebThis article was adapted from Observability in the AI-Native Era.50% Off for the Next 24 Hours: Get The BookCheers,Shreyans SinghEditor-in-ChiefEarly Bird Offer: Use code EARLY40 to get 40% OffBook Your Seat Now📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

Shreyans from Packt

06 Mar 2026

Still Using DeploymentConfig? Here's Why It's Time to Move On

Shreyans from Packt

06 Mar 2026

Learn Openshift 4CloudPro #122:Still Using DeploymentConfig? Here's Why It's Time to Move OnWe're building something new. A dedicated newsletter for network engineers working with AI, where we will talk about real challenges of bringing AI into production networks. Before we launch, we want to hear from you. This is a 30-second survey, and your answers will directly shape what we cover.Take the SurveyIf you've been running OpenShift for a few years, you almost certainly have DeploymentConfigs somewhere in your cluster. They work fine. They've always worked fine.But they've been officially deprecated since OpenShift 4.14, and it's worth understanding why. Not just because Red Hat says so, but because the underlying technology has genuinely moved on.Some context. When OpenShift 3 shipped back in 2015, Kubernetes didn't have great deployment management. DeploymentConfig filled that gap: lifecycle hooks, image change triggers, custom rollout strategies. It was ahead of its time, honestly.But Kubernetes caught up. Deployments now do automated rollbacks, HPA-based autoscaling, pause and resume during rollouts. And they're built on ReplicaSets, not ReplicationControllers. That bit matters because ReplicationController development has stopped upstream entirely. The foundation DeploymentConfig sits on isn't getting any love anymore.There's a design difference worth knowing about too. DeploymentConfig leans toward consistency: if the node running your deployer pod dies, it just waits. Waits for the node to recover, or for someone to manually step in. Deployments lean toward availability: the controller manager runs across multiple masters, so another one picks up the work. In production, you usually want that.The other thing is portability. DeploymentConfigs only exist in OpenShift. Deployments are standard Kubernetes. If you ever need to move workloads between clusters or providers, that distinction starts to matter a lot.And by sticking with DeploymentConfig, you're also cutting yourself off from tooling. No Argo Rollouts for canary or blue-green. No native HPA. No automated rollback. Manual scaling, manual rollback. It's fine until you're doing it at 2am during an incident and wishing you weren't.Nobody's flipping a switch on you tomorrow. DeploymentConfigs still run. But "it still works" isn't really a strategy. If you haven't started thinking about this, now's a good time.Cheers,Shreyans SinghEditor-in-ChiefRead The ArticleThis article was adapted from Learn OpenShift.Get The BookEarly Bird Offer LIVE Now: 40% Off. Use code EARLYBIRD40Book Your Seat NowEarly Bird Offer LIVE Now: 50% Off. No Code Needed.Book Your Seat Now📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

Shreyans from Packt

19 Feb 2026

Your Kubernetes cluster wasn't built for AI workloads

Shreyans from Packt

19 Feb 2026

By Shadab Hussain and Sandeep RaghuvanshiCloudPro #120Most AI workshops teach you how to deploy a model.This one teaches you what happens after. When the traffic spikes, the GPU scheduling fails, and your platform team is debugging it at 2 AM.Early Bird Offer LIVE Now: 40% Off With Code EARLY40Book Your Seat NowOffer Ends in 72 HoursLearn from engineers who run AI infrastructure at scale:Shadab Hussain: Google Developer Expert (AI/ML), Lead Engineer at MathCoSandeep Raghuwanshi:Former Kubernetes SME at Microsoft, Head of DevOps & InfoSec at BureauNicolas Vermandé: Senior Developer Advocate at ScaleOps, CCIE #47363 & VCDX #55Derek Ashmore: Agentic AI Enablement Principal at Asperitas ConsultingWhat you'll walk away with:Infrastructure patterns for safely running AI workloads and agents on K8sGPU scheduling and scaling strategies that hold up under real loadZero-trust security controls for AI agent trafficA tested playbook for debugging AI-related production failuresFirst-hand experience from a capstone incident simulation: traffic spike, resource contention, partial failureEarly Bird Offer LIVE Now: 40% Off With Code EARLY40Book Your Seat NowOffer Ends in 72 HoursCheers,Shreyans SinghEditor-in-Chief📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

Shreyans from Packt

13 Feb 2026

How to Build Always-On Applications on Azure

Shreyans from Packt

13 Feb 2026

By Stephane EyskensCloudPro #119WATCH NOWToday’s CloudPro Expert Article comes from Stéphane Eyskens, a Microsoft Azure MVP and seasoned solution architect with over a decade of experience designing enterprise-scale cloud systems. Stéphane is the author of The Azure Cloud Native Architecture Mapbook (2nd Edition), a comprehensive guide featuring over 40 detailed architecture maps that has earned 5.0 stars on Amazon and become an essential resource for cloud architects and platform engineers.In the article below, Stéphane tackles one of the most challenging aspects of Azure architecture: building truly resilient multi-region systems, with concrete examples using Azure SQL, Cosmos DB, and Azure Storage, complete with code samples and Terraform scripts you can adapt for your own DR testing.Happy reading!Shreyans SinghEditor-in-ChiefSAVE THE ARTICLE FOR LATERThe Azure Cloud Native Architecture Mapbook, Second EditionDesign and build Azure architectures for infrastructure, applications, data, AI, and security.Get 40% off eBookAnd 20% off PaperbackFor the next 72 HoursGET THE BOOKI like to say that Azure is simple… until you go multi-region. The transition from a well-designed single-region architecture to a truly resilient multi-region setup is where simplicity gives way to nuance. Concepts that seemed abstract (high availability versus disaster recovery, failover semantics, DNS behavior, data replication guarantees) suddenly become very real, very concrete, and sometimes painfully operational.This article is written for architects and senior platform engineers, who already understand the fundamentals but are required to build solutions that must remain available despite regional outages, service failures, or infrastructure-level incidents. The scope is intentionally narrowed to Recovery Time Objective (RTO). Data corruption, ransomware, and backup-based recovery are explicitly out of scope. Instead, the focus is on how applications and data services behave during live failover scenarios, and how architectural decisions, sometimes subtle ones, can make the difference between a seamless transition and a prolonged outage.Through concrete examples using Azure SQL, Cosmos DB, and Azure Storage, this article explores how replication models, DNS design, private endpoints, and SDK behavior interact at runtime, and what architects must do to ensure their applications remain functional when regions fail.Rather than focusing on theoretical patterns, the goal here is pragmatic—minimizing downtime and operational friction when things do go wrong. You’ll see diagrams, Terraform and deployment scripts, plus .NET code samples you can adapt for your own DR tests and game days.Before getting into the details, let’s briefly revisit the difference between high availability (HA) and disaster recovery (DR).HA and DR exist on a spectrum, with increasing levels of resilience depending on the type of failure you want to withstand:Application-level failures: In some cases, you may simply want to tolerate application bugs—for example, a memory leak introduced by developers. Running multiple instances of the application on separate virtual machines, even on the same physical host, can already prevent a full outage when one instance exhausts its allocated memory. That is for instance, what you would get if you spin up 2 instances of an Azure App Service within the same zone (no zone redundancy).Hardware failures: To handle hardware failures, workloads should be distributed across multiple racks. That is what you would get if you’d host virtual machines on availability sets.Data centre–level outages: To withstand more severe incidents, workloads should be spread across multiple data centers, such as by deploying them across multiple availability zones. You can achieve this by turning on zone-redundancy on Azure App Service or use zone-redundant node pools in AKS. With such a setup, you should survive a local disaster such as fire, flooding, etc.Regional outages: Finally, to survive major outages, such as a major earthquake, a country-level power supply issue, etc., workloads must be deployed across geographically distant data centers. You can achieve this by deploying workloads across multiple Azure regions in active/active or active/passive mode.Looking at Azure SQLLet’s first analyse the different data replication possibilities with Azure SQL. Table 1 summarizes the different capabilities.Table 1 – Replication capabilitiesWe’ll set aside named replicas and geo-restore, as the former does not contribute to disaster recovery and the latter is likely to introduce significant downtime and potential data loss. This leaves geo-replication as the remaining option. As you might have understood by now, using Azure SQL’s built-in capabilities, you cannot achieve a full ACTIVE/ACTIVE setup since it doesn’t support multi-region writes. This means that you can only have one read-write region and the secondary region(s) are read only.Table 2 outlines the two available geo-replication techniques.Table 2 – Geo replication optionsActive geo-replication may require updates to connection strings or DNS records to point to the new primary after a failover. That said, the actual impact depends on where (*) the client application is located as well as how you deploy to both regions. Let’s look at this in more detail. Figure 1 illustrates an active geo-replication setup between Belgium Central and France Central.Figure 1 – SQL geo replication with active geo replicationIn such a setup, under normal circumstances:Workloads in the primary region (Belgium Central) can connect to the primary server in read/write modeWorkloads in the primary region can perform read-only activities against the secondary replica, providing they tolerate the extra latency incurred by the roundtrip to the remote region (France Central).Workloads in the secondary region (if any), can perform read-only operations against the read replica with no extra latency.The configuration shown in Figure 1 supports a database-only failover. Both regions expose private endpoints to both SQL servers and rely on region-scoped DNS zones.Although Private DNS zones are global by design, keeping them regional allows each region to resolve both the primary and secondary servers. This requires four DNS records in total—primary and secondary endpoints registered in each regional zone.With a single shared DNS zone, this would not be possible: while all four private endpoints could be deployed, only two DNS records would be registered, since the endpoints map to just two FQDNs (primary and secondary). While this approach works, it keeps the regions siloed and prevents any cross-region traffic. From a resilience standpoint, it is preferable to provide as many fallback paths as possible.Moreover, as we will see later, with other resources such as Storage Accounts, a single DNS zone would force us to update the DNS records upon failover, causing a minimal downtime. Bottom line: using multiple DNS zones prevents issues during failover.Back to active geo replication! In case of failover, SQL servers switch roles: the primary becomes secondary and vice versa. This concretely means that the connection string primary.database.windows.net targets the read/write region in a normal situation but a read-only or unavailable one after failover. Workloads using this connection string would either stop working (if the regional outage persist), either talk to a read-only database instead of a read-write one, once the failover completed. Similarly, the connection string secondary.database.windows.net usually targeting the read-only region under normal circumstances now targets the read-write one after failover.Knowing this, a few options exist:You may choose to fail over everything (database+compute). In that scenario, workloads running in the secondary region can use their default secondary connection string, which will automatically target the new primary after failover. This approach requires the deployment pipeline to be region-aware, detect the target region, and apply the appropriate connection string. When deployed in the primary region, the application should use primary.database.windows.net, while in the secondary region it should already be configured with secondary.database.windows.net. This design eliminates the need for any connection string changes after failover. If your webapps, K8s pods, etc. are already up and running, the only thing you still have to do is route traffic to them. Any other SQL client not running in the secondary region (eg: on-premises), would have to update its connection string to target the new primary.You may choose to redeploy the compute infrastructure (web apps, etc.) to the secondary region only in case of regional outage. This approach is cheaper but risky as you’re not guaranteed to have the available capacity and it is causing a significant downtime. However, such an approach allows you to adjust your pipelines, specify the right connection string and simply redeploy your infrastructure and/or application package.If you want to deploy the application with the exact same settings in both regions, you’ll need to update the connection string used by workloads in the secondary region, since primary.database.windows.net will now resolve to an unavailable server after failover. If the original primary later comes back online, it will return as a secondary (read-only) replica, which would not support write operations. You can as well make your application failover aware (**).You can’t simply update DNS, meaning making secondary target primary and vice versa, because the FQDN (primary-or-secondary.database.windows.net) is validated by the target server, and the names must match—so redirecting it to a different server would simply fail.In conclusion, when using active geo-replication as the replication technique, you should make your applications failover-aware (**) and pre-provision both connection strings and implement the failover/retry logic in the application code itself. You may wrap your Entity Framework context into a factory to abstract away the retry logic. Given we typically use a scoped lifetime, you may expect some HTTP requests to fail (in case of an API) but new instances targeting the right server would ultimately succeed without having to restart the application. You may as well use a geo-redundant Azure App Configuration and failover it along SQL, then switch the primary server connection string after failover. The SDK allows you to monitor a sentinel key and to reload the configuration without having to restart the application:Read The Full Article by Stephane Here18 cloud architecture books in one bundle including AWS for Solutions Architects, Kubernetes for Generative AI Solutions, and more.2000+ Bundles already sold.Get The Bundle at$858$5.9048-Hour Flash Sale: 40% off with code FLASH40Book Your Seat NowEarly Bird Offer LIVE Now: 40% Off With Code EARLY40Book Your Seat NowWebinar: How to Build Faster with AI AgentsSave Your Seat📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

The Next Agent Problem is the Bill

The gap between you and the next role is probably an event away

Your AI agents don't need more autonomy

Before You Give AI More Access, Build the Boundaries

MCP's security crisis isn't new. It's just faster.

Why most AI agents break in production

Your AI Assistant Doesn't Need More Access. It Needs a Tier List

Get 50% off on AI Platform Engineering Workshop

24 hours until our Flagship AI Agents Workshop

A Tines principal architect is teaching network engineers to build AI agents this Saturday.

Most "I Need Intune Admin Rights" Requests Aren't About Admin Rights

Three Rules for Designing an MCP Server You Won't Regret

Still Using DeploymentConfig? Here's Why It's Time to Move On

Your Kubernetes cluster wasn't built for AI workloads

How to Build Always-On Applications on Azure

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access