CloudPro

03 Nov 2025

9 min read

AI That Runs Entirely Offline: How to Build an Offline Enterprise Assistant

03 Nov 2025

By Saurabh ShrivastavaCloudPro #11082% of data breaches happen in the cloudThe reality is you can’t stop every single attack so survival depends on how fast you can recover.Join us for the Cloud Resilience Summit on December 10th to:- Build true cyber resilience by shifting to an “assume breach” strategy- Gain practical, real-world cloud insights- Ensure rapid business recovery and minimal financial impact with a cloud restoration strategySave My SpotThis week’s CloudPro Special comes from Saurabh Shrivastava, Global Solutions Architect Leader at AWS and author of the bestselling Solutions Architect’s Handbook. With over two decades in the industry, Saurabh has helped shape how enterprises build and secure cloud systems.And in today's article, he explores a radical idea: AI that runs entirely offline. No APIs, no data leaving your network. Just private, local intelligence built for sensitive environments. Sounds interesting? Read on for the full article.If you want to learn directly from him, Saurabh is hosting a live AWS Solutions Architect Associate (SAA-C03) Workshop on January 17. Its a hands-on, fast-paced session that strips the exam down to what really matters. CloudPro readers get an exclusive 40% early-bird discount with the code CLOUDPRO. Reserve your seat.Cheers,Shreyans SinghEditor-in-ChiefEarly Bird Offer: Get 40% OffUse code CLOUDPROAI That Runs Entirely Offline: How to Build an Offline Enterprise AssistantBy Saurabh ShrivastavaWorking in defense, finance, law, or a heavily regulated industry means you can't just plug into ChatGPT and call it a day. Cloud-based AI tools aren't built for environments where data leakage isn't just bad.It's catastrophic.You can't send classified intel or proprietary financial models to someone else's servers. And if you're operating in an air-gapped network? Forget about it.That's the problem this Offline Enterprise Assistant solves.It's a local AI setup that runs entirely on your own hardware. No cloud dependencies. No API keys. No data leaving your perimeter. You choose a model: LlamaCpp, Ollama, whatever fits your needs, and run it directly on your machine. Every prompt, every response, every log file stays inside your infrastructure.This matters when you're reviewing sensitive legal contracts, running R&D analyses, or automating workflows that involve confidential information. You get the productivity boost of modern AI without opening the door to external risk. It's built for teams that need full control over their tools and can't afford to trust a third party with their data.Why This Architecture Stands OutRuns Without Internet: Operates 100% offline, making it ideal for air-gapped networks or classified infrastructure.Keeps Data on Your Device: Nothing is sent out, nothing is tracked. You stay in control always.Fast and Responsive: Local inference means no lag, no rate limits, just amazing performance.Built for Sensitive Workflows: Legal reviews, research, compliance, internal tooling are all handled securely.Most teams are realizing that AI doesn’t always belong in the cloud. When you’re dealing with internal systems, sensitive data, or strict compliance rules, you need something that stays inside your walls. That’s where a local-first approach makes sense: it gives you the benefits of AI without the exposure.This Offline Enterprise Assistant is built around that idea. It’s your own assistant, running entirely on your hardware, tuned to your environment, and never sending a single request outside your network. You control how it works, how it’s updated, and what data it touches.Let’s break down how the architecture fits together.Architecture ExplanationThe offline MCP Client architecture is designed to deliver end‑to‑end private and local AI capability, without any reliance on cloud APIs or outbound network traffic. Here’s how it works:Developer: Prepares prompts or workflows using a local development environment (such as a secure IDE or terminal). All interactions originate and remain on the local device.MCP Client: Acts as the interface between the developer’s inputs and the AI model. It routes prompts to the embedded LLM, orchestrates the workflow, and handles results.Offline LLMs (LlamaCpp / Ollama): Powerful large language models are loaded and executed directly on the local hardware. No external API calls; all model inference and response generation happen on the device, fully offline.Local SQLite Database: Stores chat logs, prompts, and results securely and privately. Provides an audit trail and the ability to revisit past interactions, entirely within the local infrastructure.Secure UI/API: Presents results to the developer via a local web interface or terminal UI. Enables further integration with internal systems while ensuring data never leaves the trusted environment.Think about it. You don’t want your data, your prompts, or your workflows slipping out into the cloud. With this architecture, nothing leaves your machine.Zero external exposure. No tokens. No API keys. No hidden traffic.If you’re in aregulated industry, whether it’s defense, legal, healthcare, or any air-gapped environment, this setup checks every box. It keeps you compliant, private, and secure while still giving you the power of modern AI. And here’s the best part: it’sextensible by design.Want to add another LLM? Done.Need to customize workflows? Easy.Ready to experiment with agentic AI? Go ahead. You can build without ever breaking the privacy barrier.Most importantly, this isn’t a short-term solution. It’sfuture-proof. As on-device AI models become larger and smarter, this architecture will scale with you, handling more automation, more intelligence, and more complexity.Now it’s time to get our hands dirty and implement it.ImplementationUsing LM Studio, Streamlit, and Python, you’ll set up and run local open-source models directly on your machine. Unlike online AI assistants like ChatGPT or Google Bard, which constantly need internet connectivity and send data back to external servers, this approach runs completely offline.Along the way, you’ll gain hands-on experience with the full cycle: you’ll understand howlocal LLMsreally work, set up all the required software and dependencies, download and run an open-source model in LM Studio, and then build asimple yet powerful chat interfaceusing Streamlit. From there, you’ll integrate your local LLM into the Streamlit app and learn how to store and review chat historyusing a local database securely. By the end, you’ll have aBefore you dive into building your offlineEnterprise Assistant, it’s important to get familiar with a few key concepts.At the heart of this setup is the Offline Assistant itself: an AI system that runs entirely on your computer, performing all language model inference locally without ever needing an internet connection.Powering this is an LLM (Large Language Model), a type of AI trained on massive datasets to generate human-like text responses.To make it simple to use, you’ll rely on LM Studio, a desktop app that lets you download, run, and serve open-source LLMs on your machine, exposing them through a local API.For the interface, you’ll use Streamlit, a Python framework that makes it easy to build interactive web apps and quickly prototype AI-driven tools.And finally, for securely managing chat history, you’ll work with SQLite, a lightweight local database that keeps all your interactions private and fully stored on your device.By the end of this hands-on exercise, you’ll have your ownlocal Enterprise Assistantrunning directly in your browser—powered by an open-source LLM that operates fully offline throughLM Studio. You’ll interact with it using a simple but effective interface built withStreamlit, making your assistant practical and easy to use.Most importantly, every conversation will be securely stored as local chat logs in your system, never sent to the cloud, never exposed. By the time you’re done, you’ll walk away with a private, offline AI assistant that runs fast and stays entirely under your control.Demo Video and RepoLab guideConclusionCongratulations! You’ve just built your very own offline Enterprise Assistant, powered entirely by open-source tools and running fully on your machine. Along the way, you learned how to set up LM Studio to run an LLM locally, how to create a lightweight but effective interface with Streamlit, and how to store all your conversations securely using SQLite. Most importantly, you now understand how to put privacy first, keeping every prompt, response, and workflow under your complete control, with no reliance on external servers or cloud APIs.This hands-on exercise gave you more than just a working prototype. You gained insight into how local LLMs work, how to integrate them into real-world applications, and how to design AI tools that balance functionality with security. You’ve also seen the bigger picture: how on-device AI can reshape the way enterprises approach sensitive tasks, from R&D to legal reviews to compliance-heavy workflows.But this is only the beginning. You can now extend your Enterprise Assistant with advanced features:Add asmarter UIwith more interactive elements.Try outdifferent open-source modelsto experiment with speed, accuracy, and capabilities.Layer inanalytics and insightsto track and optimize your usage.Even push towardsagentic AI, giving your assistant the ability to automate tasks and workflows while still running securely offline.With what you’ve built, you’ve proven that you can harness the power of Generative AI without compromise: no data leaks, no internet dependency, no loss of control.Your private AI journey starts here.- SaurabhSponsored:Build your next app on HubSpot with the flexibility of an all-new Developer Platform.The HubSpot Developer Platform gives you the tools to build, extend, and scale with confidence. Create AI-ready apps, integrations, and workflows faster with a unified platform designed to grow alongside your business.Start Building Today📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

14 Oct 2025

8 min read

How MCP Turns Your IDP into an Actual Teammate

Shreyans from Packt

14 Oct 2025

8 min read

Tired of TicketOps? Here’s how to break free. CloudPro #109: How MCP Turns Your IDP into an Actual Teammate Your runbooks live in wikis. Your SLOs live in monitoring dashboards. Your approval workflows live in ticketing systems. And you, the developer who actually needs to ship things, you're stuck stitching all of this together manually, one step at a time. This is where most cloud teams still operate. And it's exhausting. I realized this after our AI-Powered Platform Engineering workshop last week. If you were there, you know it was packed. We had George Hantzaras (Director of Engineering, MongoDB) and Ajay Chankramath (Founder, Platformetrics)up there walking through how AI, golden paths, and internal developer platforms are completely reshaping the way modern teams ship software. The energy in the session was incredible. And the conversations didn’t stop when the session ended. But here’s the thing that stuck with me the most: MCP (Model Context Protocol) turns your platform into something an AI can actually do things with, not just talk about. This is the difference between TicketOps and brittle runbooks versus workflows like provisioning, rolling back, and scaling that become assistant-triggered actions with guardrails and full visibility into what’s happening. Your AI goes from being a really smart search engine to being an actual teammate. That’s the jump I think a lot of us have been waiting for. So in this issue, I’m going to dig into what that actually looks like, and give you a practical starter pack you can adapt to your own stack. Because if you’ve been wondering how to move from AI as “a chat window” to AI as “a teammate who actually does the work,” this is a good place to start. Cheers, Shreyans Singh Editor-in-Chief Bookmark This Article for Later Join Snyk on October 22, 2025 at DevSecCon25 - Securing the Shift to AI Native Save Your Spot The Problem We’re All Facing Let me paint a picture of how most cloud teams still operate today. You write a ticket. You wait for approvals. You hunt through runbooks to find the right commands to copy-paste. You watch CI/CD spin up. You cross your fingers that the safety nets catch anything that could go wrong. Documentation and internal portals help you find things, sure. But they rarely do things. That gap, between knowing what to do and actually doing it, creates this constant context switching, slow feedback loops, and inconsistent safety guardrails. The result is messy. Really messy. Brittle handoffs. Workflows duplicated across teams. And what should be “golden paths” that empower teams end up becoming “golden cages” that block velocity instead of enabling it. Here’s what I think is the real core issue: we’ve separated knowledge from action. Your runbooks live in wikis. Your SLOs live in monitoring dashboards. Your approval workflows live in ticketing systems. And your developers, the people who actually need to ship things, they’re stuck stitching all of this together manually, one step at a time. This friction doesn’t just slow shipping down. It creates inconsistency. One team deploys with proper approvals; another skips them. One team checks SLO impact before scaling; another scales blind. You end up with fragmented practices across your org, and nobody likes that. We need a different model. Instead of “read the docs, then run the steps,” we should be asking our systems: “propose a plan, then execute it with guardrails.” Here’s What Changed: The solution rests on two foundational ideas working together, and honestly, the way they complement each other is elegant. First: golden paths. These reduce decision fatigue without sacrificing flexibility. Think of a golden path as an end-to-end workflow with sensible defaults, something like: stateless service → preview environment → automated checks → production. These paths capture your team’s collective knowledge about how to do things safely. But they’re not straightjackets. You build in escape hatches for the 20% of cases that don’t fit the mold. Golden paths let you standardize without fossilizing your processes, and they’re the antidote to “TicketOps.” Second: the Model Context Protocol (MCP). This is the glue that makes it all work. MCP is a standard that exposes your golden paths as tools that an AI assistant can actually call. Think of it as an adapter layer between your assistant and your platform. The assistant can observe your systems: what services exist, what their SLOs are, what incidents are open right now. It can propose a plan (“I’ll deploy this to preview, run tests, then promote to prod”). And it can execute approved actions while logging everything for audit and observability. MCP turns your platform into a set of composable, auditable operations. Here’s what this looks like in the real world. An engineer types: “Spin up a preview environment and load our staging data.” The assistant gathers context. What’s the service’s dependency tree? What are the current SLO baselines? Who owns this service? What recent incidents or postmortems should I know about? It drafts a plan that respects your runbooks and standards. It shows you the diffs and policy checks that apply. Then it either executes immediately (low-risk actions) or routes to the right approver (medium/high-risk actions). Everything gets logged. This isn’t theoretical anymore. It works. Bookmark This Article for Later How You Actually Build This The architecture starts simple: pick one golden path. One high-value workflow, like service deployment, that you want to standardize. Next, you build MCP servers that act as adapters to your existing systems. Your runtime layer exposes Kubernetes operations (deploy, scale, health checks) and feature flags as callable tools. Your delivery layer connects to Git, GitHub Actions, or GitLab CI to handle builds, promotions, and canary deployments. Your reliability layer taps into SLO systems, alerting platforms, and observability tools so the assistant can query metrics, create SLOs, and mute alerts with proper context. Your change control layer enforces approvals and integrates with compliance workflows. Underneath all of this runs retrieval-augmented generation (RAG). Before the assistant proposes any action, it indexes and searches your runbooks, compliance standards, postmortems, and architectural guidelines. The plan that comes back doesn’t just say “deploy the service”, it references your procedures. “Deploy using our standard blue-green pattern [link to runbook]. SLO impact: +2% latency at p99 (within budget). Dependencies: notify the data team [link to postmortem]. Policy check: OK, proceed.” Safety gets enforced as policy-as-code, not friction. You define risk tiers: low-risk actions (like querying logs) auto-execute. Medium-risk actions (like scaling a service) require human approval. High-risk actions (like deleting a database) are blocked unless routed through a formal approval path. Standards violations get auto-fixed when confidence is high, or escalated when not. Everything is logged. The user experience ties it all together. An engineer uses chat or a slash command to describe what they need. The assistant gathers context from your systems and knowledge base, drafts a plan with full diffs and impact estimates, shows what policy checks apply, and executes: staying in the loop for decisions that matter (approvals, tradeoffs) but handling the rote work. Everything that happened gets logged: the initial request, the plan, who approved it, what actually ran, what the outcome was. The Real Payoff Over time, this architecture unifies your platform. Your internal developer platform (IDP) becomes more than a portal where people click and read documentation. It becomes an actor that moves with your engineers, understanding context, respecting guardrails, and turning workflows into outcomes. You measure it differently too. Not by pageviews to your portal, but by lead time to production. By mean time to recovery when things break. By consistency of safety across teams. You tie actions to business impact: AI-assisted scaling and predictive capacity planning can deliver up to 25% cost optimization while keeping SLOs intact. The shift from ticket-based workflows to action-based workflows with AI assistance and policy-driven safety doesn’t happen overnight. Start with one golden path. Build the MCP adapters for one system at a time. Add guardrails incrementally. But the direction is clear. Push intelligence and safety down into your platform so that doing the right thing becomes the easy thing. And the hard work, decision-making, tradeoff analysis, exception handling, stays where it belongs: with your engineers. That’s where we’re headed. And honestly? I think it’s going to change how we all work. 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

26 Sep 2025

2 min read

24 Hours Left: AI Powered Platform Engineering Workshop starts tomorrow

Shreyans from Packt

26 Sep 2025

2 min read

Your last chance to join. Hurry!24 Hours Left - AI Platform Engineering Workshop starts tomorrow! This is your last chance to join.FINAL CALL: 24 Hours RemainingUse code FINAL40 to get 40% OffThis is your last chance to join 4 industry-leading instructors in this live 5-hour intensive workshop:George Hantzaras:Director of Engineering, MongoDBAjay Chankramath:Founder & CEO, PlatformetricsDr. Gautham Pallapa:Principal Director, Cloud, Data & AI, ScotiabankMax Körbächer:Founder, Liquid ReplyFellow attendees already secured their spots:Principal Product Manager, Walmart Global TechTechnical Manager, Mercedes BenzSystems Engineer, eBayStaff DevOps Engineer, FlixSenior Software Engineer, NBCUniversalFINAL CALL: 24 Hours RemainingUse code FINAL40 to get 40% OffWhat you'll walk away with:AI-driven platform frameworks for immediate useYour personalized 90-day roadmapFree ebook: Mastering Enterprise Platform Engineering ($31.99 value)Exclusive community access with all speakers & attendeesExclusive offer on CNPA, in partnership with the Linux FoundationTomorrow, September 27 | 9am - 2pm EDTDon't let this opportunity slip away. Secure your spot now, at 40% off.We’d be delighted to see you there.From all of us at PacktFINAL CALL: 24 Hours RemainingUse code FINAL40 to get 40% OffGet the Bundle for $18*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

23 Sep 2025

6 min read

Sneak Peek into our AI Platform Engineering Workshop

Shreyans from Packt

23 Sep 2025

6 min read

By George Hantzaras, MongoDB’s Director of Engineering CloudPro #108 Last Few Seats for the AI-Powered Platform Engineering Workshop 48 Hours Left: Seats Go Back to Full Price After Book Now: FINAL40 expires in 48 Hours Today’s CloudPro is written by George Hantzaras, Director of Engineering at MongoDB, and a featured speaker at KubeCon and PlatformCon. In this issue, George shares why most platform teams struggle with adoption: not because of tools, but because they don’t treat knowledge as a product. His insights will help you build platforms that developers actually want to use. If you want to dive deeper, George will also be leading a hands-on workshop this Saturday, September 27: AI-Powered Platform Engineering. Across five hours, he’ll cover everything from golden paths to AI-driven ops. Here’s what you’ll walk away with: Blueprints for embedding AI into golden paths and self-service workflows A personalized 90-day implementation roadmap Exclusive access to leaders from MongoDB, Scotiabank & Platformetrics Bonus:Instant access to 3 Most Watched videos from our AI Summit (AI Agents, RAG and LLMOps) Last few tickets left. Don’t miss your chance to learn directly from George and leading platform engineers. Secure your spot now before seats are gone. Cheers, Shreyans Singh Editor-in-Chief 48 Hours Left: Seats Go Back to Full Price After Book Now: FINAL40 expires in 48 Hours Cloud Ransomware Tabletop: Unpacking an Attack from Detection to Recovery Cloud ransomware isn't a matter of if, but when. Are you truly ready? Watch our immersive fireside chat on October 1st @ 9 AM PDT as we unpack a fictional, yet alarmingly realistic, cloud ransomware attack on Horizon Retail. Save Your Spot The Knowledge Problem That's Killing Your Platform Team A preview of what I'll be diving into at our upcoming workshop - By George Hantzaras Hey everyone, I've been thinking a lot lately about why some platform teams absolutely nail developer experience while others struggle to get adoption. After working with dozens of organizations, I've noticed something interesting: it's rarely about the tools. The teams that win treat knowledge as a platform capability. The ones that struggle? They're still playing whack-a-mole with scattered docs and tribal knowledge. Let me paint you a picture. Your best senior engineer just spent 45 minutes figuring out how to deploy a simple service change. Not because the deployment is hard, but because they couldn't remember which of the three "official" ways actually works, who to ping for approval, and whether the security policies changed since last month. Sound familiar? We're Solving the Wrong Problem Most platform teams I talk to are obsessed with building the perfect CI/CD pipeline or the most elegant service mesh. But here's what I've learned: if developers can't easily figure out how to use your beautiful platform, you might as well not have built it. I was working with a team at a mid-sized startup recently. They'd spent months building this gorgeous deployment system with all the bells and whistles. Adoption was terrible. Turns out, the "getting started" doc was three months out of date, buried in Confluence, and written by someone who'd left the company. The fix wasn't technical. They needed to treat knowledge as a product. What Actually Works The teams getting this right do four things consistently: They catalog everything. Not just services, but owners, docs, health metrics, cost data. All searchable, all in one place. Backstage is great for this, but the specific tool matters less than the commitment to making information findable. They make standards executable. Instead of hoping people read the security guidelines, they bake them into CI/CD. Policies become code that runs automatically and leaves an audit trail. Much harder to ignore a failing pipeline than a Slack reminder. They productize their knowledge. This is the big one. Take that complex, error-prone workflow that only two people really understand and turn it into a golden path that anyone can follow. Start small. Maybe it's just a better way to set up CI/CD, but treat it like a product with real users and feedback loops. They keep Git sacred. Discovery happens in the portal, but changes flow through Git. Period. Whether you go with one big monorepo or team-owned repos doesn't matter as much as consistency and reviewability. The Unexpected AI Play Here's something I'm excited about: when you get the knowledge foundation right, AI becomes incredibly powerful. Not in a "replace developers" way, but in a "supercharge the platform" way. All that clean, structured context in your service catalog? Perfect retrieval corpus for RAG systems. Those golden paths you built? They become scaffolding for AI-generated suggestions that actually make sense in your environment. Your Policy as Code setup? Natural guardrails for AI-proposed changes. I'm seeing early experiments where AI can draft infrastructure PRs that pass all your checks because it understands your actual standards, not just generic best practices. What I'm Seeing Work The measurement piece is crucial. DORA metrics are great, but they're lagging indicators. I'm more interested in leading signals: How long does it take a new engineer to make their first production deploy? What's your golden path adoption rate? How often do teams bypass your "blessed" workflows? Track this stuff in your developer portal, not some executive dashboard. Make the feedback loop as tight as possible. Starting Simple You don't need to boil the ocean. Pick one workflow that causes the most pain, usually it's CI/CD setup or production deployments, and make it bulletproof. Document it, template it, instrument it. Then expand from there. The teams that succeed start with developer empathy, not technical architecture. What's Next I'll be diving much deeper into all of this at our upcoming workshop, including some real-world examples from teams who've made this transition successfully. We'll look at specific implementation patterns, common pitfalls, and how to measure success. If your platform team is struggling with adoption, or if you're curious about the AI angle, I'd love to see you there. Talk soon, George 48 Hours Left: Seats Go Back to Full Price After Book Now: FINAL40 expires in in 48 Hours 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans Singh

15 Sep 2025

3 min read

Don’t miss this: AI-Powered Platform Engineering workshop (Sept 27)

Shreyans Singh

15 Sep 2025

3 min read

Hands-on workshop + expert panel. Special offer for CloudPro readers.CloudPro #107: Special IssueI'm interrupting our regular newsletter schedule today because something came up that I genuinely think you need to know about.I want to tell you about our event on September 27th that could be a game-changer for how you think about platform engineering.MongoDB's Director of Engineering shows you how to build AI-powered developer platforms that actually work at scaleExclusive 40% Off for CloudPro SubscribersUse code CLOUDPROHere's why I'm personally excited about this:We've got George Hantzaras from MongoDB leading a 5-hour intensive on AI-Powered Platform Engineering. And when I say intensive, I mean it – this isn't another surface-level "AI is the future" talk. George is the Director of Engineering at MongoDB, speaks at Kubecon and HashiConf, and he's going deep into the practical stuff that actually matters.Agenda for the workshop:Self-Service Golden Paths – build workflows that reduce friction while keeping developer flexibilityKnowledge as a Platform Capability – embed organizational knowledge with AI (RAG, context modeling)Intelligent Developer Portals – natural language interfaces and scaffolding services that understand developer needsAI-Driven Operations – anomaly detection, observability, and incident triage beyond traditional monitoringWhy this matters for your daily work:If you’re working with monitoring stacks like Prometheus or Grafana, George’s approach to integrating runbooks, standards, and service catalogs into developer workflows will feel directly applicable.Exclusive 40% Off for CloudPro SubscribersUse code CLOUDPROOur Panelists:We're not just doing sessions. We've also put together a panel:Ajay Chankramath – Founder, PlatformetricsDr. Gautham Pallapa – Principal Director, Cloud, ScotiabankMax Körbächer – Founder, Liquid ReplyTogether, they’ll unpack the real-world challenges and production patterns they’re seeing across industries.How You’ll Leave PreparedGeorge is ending the day with something I've never seen at these events – a structured workshop to draft your actual 90-day pilot plan. You're walking out with a personalized roadmap, not just ideas.Why This Event is DifferentFocuses on implementation, not hypeGives you time to go deep (5 hours, not 50 minutes)Ends with an actionable plan, not just slide decksExclusive 40% off for CloudPro subscribersExclusive 40% Off for CloudPro SubscribersUse code CLOUDPROBest,ShreyansEditor-in-Chief, CloudPro📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

15 Sep 2025

3 min read

Learn AI Platform Engineering

Shreyans from Packt

15 Sep 2025

3 min read

From MongoDB's Director of Engineering CloudPro #107: Special Issue I'm interrupting our regular newsletter schedule today because something came up that I genuinely think you need to know about. I want to tell you about our event on September 27th that could be a game-changer for how you think about platform engineering. MongoDB's Director of Engineering shows you how to build AI-powered developer platforms that actually work at scale Exclusive 40% Off for CloudPro Subscribers Use code CLOUDPRO Here's why I'm personally excited about this: We've got George Hantzaras from MongoDB leading a 5-hour intensive on AI-Powered Platform Engineering. And when I say intensive, I mean it – this isn't another surface-level "AI is the future" talk. George is the Director of Engineering at MongoDB, speaks at Kubecon and HashiConf, and he's going deep into the practical stuff that actually matters. Agenda for the workshop: Self-Service Golden Paths – build workflows that reduce friction while keeping developer flexibility Knowledge as a Platform Capability – embed organizational knowledge with AI (RAG, context modeling) Intelligent Developer Portals – natural language interfaces and scaffolding services that understand developer needs AI-Driven Operations – anomaly detection, observability, and incident triage beyond traditional monitoring Why this matters for your daily work: If you’re working with monitoring stacks like Prometheus or Grafana, George’s approach to integrating runbooks, standards, and service catalogs into developer workflows will feel directly applicable. Exclusive 40% Off for CloudPro Subscribers Use code CLOUDPRO Our Panelists: We're not just doing sessions. We've also put together a panel: Ajay Chankramath – Founder, Platformetrics Dr. Gautham Pallapa – Principal Director, Cloud, Scotiabank Max Körbächer – Founder, Liquid Reply Together, they’ll unpack the real-world challenges and production patterns they’re seeing across industries. How You’ll Leave Prepared George is ending the day with something I've never seen at these events – a structured workshop to draft your actual 90-day pilot plan. You're walking out with a personalized roadmap, not just ideas. Why This Event is Different Focuses on implementation, not hype Gives you time to go deep (5 hours, not 50 minutes) Ends with an actionable plan, not just slide decks Exclusive 40% off for CloudPro subscribers Exclusive 40% Off for CloudPro Subscribers Use code CLOUDPRO Best, Shreyans Editor-in-Chief, CloudPro 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

08 Sep 2025

5 min read

Batch Scoring on Azure ML

Shreyans from Packt

08 Sep 2025

5 min read

5 Knobs That Save You from Nightly HeadachesCloudPro #106Hunt Threats, Recover Fast: Next-Gen Cyber Resilience for Google CloudJoin Hunt Threats, Recover Fast: Next-Gen Cyber Resilience for Google Cloud, a virtual event about going beyond traditional backup.You'll see:- Real-time ransomware detection and automated threat hunting for Google Cloud- Turbo Threat Hunting in action to trace attack paths and accelerate incident response- Streamlined recovery workflows that simplify protecting your Google Cloud workloadsSave Your SpotToday’s CloudPro is about the five batch-scoring knobs most engineers overlook. If you’ve ever watched a job stretch from minutes to hours and wondered why, this is where you start.This article is adapted fromChapter 5 ofHands-On MLOps on Azure. In that chapter, author Banibrata De dives into the gritty details of model deployment: batch scoring, real-time services, and the YAML settings that make the difference between smooth pipelines and midnight firefights.(The book goes much further, covering CI/CD pipelines, monitoring, governance, and even LLMOps across Azure, AWS, and GCP. CloudPro readers can grab it at the end of this piece with an exclusive discount.)Cheers,Shreyans SinghEditor-in-ChiefGET THE BOOKSAVE THIS ARTICLE AND READ LATERTuning Batch Jobs on Azure ML: 5 Knobs Every Engineer Should KnowSHARE THIS ARTICLEIt’s late. The batch run you trusted starts crawling. Dashboards spike, Slack pings light up, and you’re debating whether to kill the job or ride it out. You don’t need a re-platform. You need to tune the controls Azure ML already gives you.Below are thefive knobsthat tame throughput, flakiness, and costs. They live in your batch deployment YAML, and they work.1) mini_batch_size: The throttle for your workloadBatch jobs in Azure ML process data in chunks.mini_batch_sizecontrols how big each chunk is. Push it too high, and you’ll hit memory or I/O bottlenecks; keep it too low, and you’ll waste time on overhead. Think of it like loading a truck: too few boxes and you’re underutilizing space, too many and you risk breaking the axle. Getting this balance right often cuts hours off long-running jobs.2)max_concurrency_per_instance: How many cooks in the kitchenEach compute node can process tasks in parallel, but how many at once depends on its resources.max_concurrency_per_instanceis that dial. If you pack too much onto a single node, CPU and memory will thrash, and everything slows down. Start low, then gradually raise it while watching system metrics. The goal is steady throughput, not chaos.SAVE THIS ARTICLE AND READ LATER3)instance_count: Scale out, don’t just scale upEven with tuned concurrency, sometimes one node just isn’t enough. That’s whereinstance_countcomes in. It decides how many nodes you’ll spread the workload across. It’s the knob you turn when you need predictable completion times. For example, making sure the nightly run finishes before business hours. More nodes mean more cost, but also fewer late-night surprises.4)retry_settings: Resilience for the real worldIn batch jobs, things fail: a network hiccup, a corrupted file, a transient storage timeout. Without retries, the whole job can collapse because of one small blip.retry_settingslets you say, “Try again a few times before giving up.” Set sensible timeouts and retries per mini-batch so small failures don’t derail the entire pipeline.5)error_threshold: Fail smart, not earlyWhat happens if some data records are bad? By default, too many errors can abort the run. Witherror_threshold, you control how many you’ll tolerate. Setting it to-1tells Azure ML to ignore errors completely. For messy real-world datasets, this is a lifesaver: you can still ship 99% of results and deal with the outliers later, instead of losing the entire batch.Extra sanity checksRespect the contract:Batch jobs are built forfiles/blobs in, files/blobs out. Don’t try to wrap them around per-record HTTP calls.Keep scripts separate:Usebatch_score.pyfor batch andonline_score.pyfor real-time. Different handlers, different expectations.Watch metrics that matter:Throughput, per-batch latency, error rate, and CPU/GPU/memory use. Wire alerts so you’re not caught off-guard at 2 a.m.TakeawayBatch scoring doesn’t have to be a black box. Azure ML gives you the levers. You just have to use them. Tune these five settings, keep batch and online flows separate, and you’ll get faster, more reliable runs without babysitting every night.This walkthrough is pulled straight from Chapter 5 ofHands-On MLOps on Azure. The full book expands on everything here: deployments, monitoring, alerting, governance, pipelines, and operationalizing large language models responsibly.For the next48 hours, CloudPro readers get35% off the ebookand20% off print. If Azure ML is part of your stack, or about to be, this is the reference worth keeping open on your desk.GET THE BOOK📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

01 Sep 2025

7 min read

AI alone won’t deliver the autonomous network

Shreyans from Packt

01 Sep 2025

7 min read

By Daren FulwellCloudPro #1053rd Sept: Why Cloud, Why Now? Join Forrester & Atlassian to understand costs, risks, and AI opps.Learn MoreToday’s CloudPro is written by Daren Fulwell, Field CTO at IP Fabric. Daren has over 25 years of experience in networking, and he is well-known for his work mentoring engineers, and helping organizations bridge the gap between design fundamentals and modern automation.In today’s article, he explores why AI on its own won’t deliver true network autonomy. Instead, Daren shows how the real path forward lies in combining automation, AI agents, and a network digital twin, so teams can move beyond hype and build networks that are transparent, predictable, and genuinely autonomous.And if you’re looking to put these ideas into practice, we’ve just released theNetwork Automation Cookbook, Second Edition. Packed with over 100 hands-on recipes, it shows you how to automate network devices and cloud platforms using Ansible, AWX, Nautobot, and Terraform. It’s 30% off exclusively for CloudPro readers, for the next 72 hrs. Grab your copy and start building real-world automation workflows today.Cheers,Shreyans SinghEditor-in-ChiefSAVE THIS ARTICLE AND READ LATERAI alone won’t deliver the autonomous networkBy Daren FulwellSHARE THIS ARTICLEIn the network engineering community right now - as in most areas of IT - you can't escape the AI hype. We've been working to understand how network automation will change the way we operate our infrastructure, and agentic AI is being proposed as the missing piece of the puzzle. Folks in the know have been experimenting and making their results available in blog posts and Youtube videos for the rest of the world to salivate over. Finally, it looks like we have taken the right turn towards the self-driving network.Or have we? Are a handful of small-scale experiments with limited scope and even more limited capability proving anything? At best, there is a lot more investigation required, at worst the experiments that we don't see are proving that AI is not to be trusted with our critical infrastructure yet.Networks aren't just collections of individual devices that we configure and then they do what we tell them: they are interconnected, propagating their world view to their immediate neighbors and beyond, to create a "hive mind" behavior for the whole system. And in most cases, our networks are actually networks of networks - interconnected and sharing state information to extend that collective view from user to workload.In traditional network operations, this meant having multiple teams - with their own documentation and subject matter experts in the technologies and platforms - who all needed to interface to provide end-to-end service. Maintenance of the infrastructure required deep collaboration between teams and across silos. A thorough understanding of the networking technologies needed to be applied to tooling and documentation to ensure change impacts were tracked and understood.In the agentic AI world, this is taken to the next level. The work is divided up for agents to be given small, carefully-defined scopes to work within, making specific types of change or reporting on specific behaviors. But due to the distributed, interconnected nature of networking, none of those agents can work independently of the others: the effects caused by one will potentially be felt by them all. Without true collaboration between the agents, it becomes impossible for us to trust that they will give us the desired outcomes without humans (with an understanding of the infrastructure) manually checking everything they do.In short, AI agents cannot operate the network autonomously without some collective understanding of the end-to-end network.The Sources of Truth that we have been building for our network automation processes seem to fulfil at least elements of this need. But they alone are not enough as they really represent the desired state of the network, not its current operating state.Consider these four key requirements for that source of knowledge:We need a view of the end-to-end network in the form of structured data, with a well-documented schema, and able to be accessed over clearly defined APIs or protocols to provide a consistent end-to-end view to all agentsIt must be a complete and up to date view of the network as it is operating. There is little point in having a clear view of part of the infrastructure then little or no understanding of other parts when one can so heavily impact the operation of the other.Relationships and collective behavior are key to understanding how the network behaves: maintaining a list of devices (that may or may not be complete) and some data points about those devices may be useful but does not give the full pictureNetwork behavior is on the whole deterministic: a set of devices with specific state and connectivity should always behave the same way. So the data model must be based on facts collected from the network devices themselves, analysed and modelled as behavior - rather than being formed from conjecture, opinion or correlation of related events (the best you can expect from that is a general indication of direction)A true Network Digital Twin has all of these characteristics:SAVE THIS ARTICLE AND READ LATERA Network Source of Truth system has some of these, but misses the key aspect of understanding network behavior end-to- end. For example, consider:A change is required to enable Internet users gain access to an application hosted in a private DC. The external firewall is updated with NAT rules and policy changes to provide access; DNS changes are made; and routing is checked from the DC to ensure that traffic can be forwarded from user to frontend and back. But it still fails when the changes are pushed, because the security policy applied in the DC fabric only allows testing from internal hosts. The coordinated effort across multiple domains (read AI agents) has failed due to an incomplete view of the service dependencies.A DR exercise is under way, causing applications to be switched from one location to another. Load balancing rules are being changed to facilitate that and the virtual IP successfully moves traffic flows to the new location. Two of the four servers in the load balancing pool are working fine, so the pool is up and being serviced, but not at full capacity. While the remaining servers are up and the correct services are running, routing from the load balancer to those servers is not correct: using the Digital Twin this end-to-end behavior can be diagnosed in advance and remediation carried out to fix this before live traffic is diverted through this path.AI is going to change the way we operate networks. But in order to deliver its true potential, it needs not only to be able to deliver automated process, but to be fed real understanding of the networks it will operate in order to validate that it is doing what it needs to.If today’s article got you thinking about how to move fromtalking about automationto actually building it, you’ll want to check out our brand new release:Network Automation Cookbook, Second Edition.This updated edition is packed with over 100 hands-on recipes showing how to use Ansible, AWX, Nautobot, and Terraform to automate both on-prem and cloud networks. It’s written for engineers who want practical workflows, not just theory, and every recipe comes with reproducible labs so you can practice safely.As a CloudPro reader, you can grab it at30% offfor the next 72 hours.GET THE BOOK📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

25 Aug 2025

7 min read

Why LXD beats VMs and Docker for Ubuntu dev

Shreyans from Packt

25 Aug 2025

7 min read

By Ken VanDine CloudPro #104 In today's CloudPro, he looks at how developers can use LXD containers on Ubuntu to keep their setups clean, spin up secure environments quickly, and avoid the headaches of traditional VMs. Cheers, Shreyans Singh Editor-in-Chief Share This Article! Save This Article for Later Have you ever broken your Ubuntu setup by installing conflicting packages, or wasted an entire afternoon waiting for a virtual machine to boot just to test one library? These little frustrations are all too common for developers. Containers on Ubuntu offer a cleaner way forward, and with LXD, you can build fast, secure, and reproducible environments that don’t weigh down your system. Streamlining and Securing Development with Containers on Ubuntu Developers often add tools and services directly onto their Linux desktops because it feels quick and convenient. The problem is that every new service increases your system’s attack surface, especially if something opens a network port in the background. A safer, more efficient approach is to run your development environments in containers. You spin them up only when you need them, and they stay isolated from your daily workflow. In this article, we’ll discuss LXD, the Linux Container Daemon, due to its outstanding integration with Ubuntu. Conceptually, this would apply to other container technologies as well, but the usage would vary. LXD stands out as a powerful solution for developers using Ubuntu, providing a lightweight and flexible approach to containerization, offering a compelling alternative to traditional virtual machines and other container technologies. Why LXD instead of Docker or VMs? If you’ve worked with Docker or virtual machines, you may wonder why LXD matters. Here’s a quick comparison to put it into perspective: LXD vs Docker vs VMs at a glance That last row is the key: LXD gives you a lightweight “virtual machine–like” environment, without the VM bloat. Why Choose LXD for Development? LXD containers are specifically designed to meet the needs of developers, offering a unique set of benefits: Lightweight and Efficient: Unlike resource-intensive VMs that require full OS installations, LXD containers share the host kernel. This minimizes overhead, leading to significantly faster boot times, a reduced memory footprint, and improved overall performance. This efficiency is crucial for rapid iteration and testing across various environments without performance penalties. Image-Based Management: LXD's reliance on images for container creation transforms environment management. It enables effortless sharing, versioning, and reproducibility of development environments, ensuring consistency across different machines and simplifying collaboration. This approach streamlines workflows, allowing developers to spin up new environments with specific configurations and dependencies quickly. Security Fortified: LXD provides robust isolation through kernel namespaces and advanced security features, safeguarding the host system and other containers from potential vulnerabilities. This secure environment allows developers to focus on their code with peace of mind. Scalability and Flexibility: LXD excels in scalability, allowing developers to easily create multiple isolated environments for different projects, branches, or feature implementations. This fosters a highly organized and efficient development process, enabling rapid switching between environments without impacting other projects. Seamless Ubuntu Integration: LXD's tight integration with Ubuntu leverages the operating system's robust package management system and offers access to a vast repository of pre-built images and tools. This streamlines development and ensures compatibility with a wide range of software and libraries. Quick use case: Let’s say you want to try out PostgreSQL 17 without risking your workstation setup. Launch an LXD container, install PostgreSQL inside it, test it safely, and if things break, just roll back to a snapshot. Your main system stays untouched. Getting Started with LXD on Ubuntu Installing and configuring LXD on Ubuntu is straightforward, as it's available directly from the Snap Store. ken@monster:~$ sudo snap install lxd ken@monster:~$ sudo usermod -aG lxd "$USER" ken@monster:~$ newgrp lxd ken@monster:~$ lxd init --auto The lxd init --auto command initializes LXD with recommended settings. For more control, omit --auto to go through an interactive configuration process, allowing you to choose storage backends (like ZFS for advanced features or LVM for flexibility), configure network settings (bridge interfaces or NAT), and set up image remotes to access pre-built images. Essential LXD Container Management Commands LXD provides a comprehensive command-line interface (CLI) for managing containers: lxc launch <image> <name>: Creates and starts a new container from an image. lxc list: Displays a list of all running containers. lxc start/stop/restart <name>: Manages container lifecycle. lxc exec <name> -- <command>: Executes commands within a running container. lxc file push/pull <local_path> <remote_path>: Transfers files between the host and a container. For development, it's often more convenient and secure to run as an ordinary user with your home directory mapped into the container: ken@monster:~$ lxc launch ubuntu:25.04 plucky-devel -c raw.idmap="both $UID 1000" ken@monster:~$ lxc config device add plucky-devel homedir disk source=$HOME path=/home/ubuntu ken@monster:~$ lxc exec plucky-devel -- su -l ubuntu This configuration launches a container, maps your user ID, mounts your home directory, and provides a login shell as the ubuntu user inside the container, allowing you to use your favorite editor on your host system while executing code within the isolated container. Unlocking Advanced Features LXD offers features that further enhance development workflows: Remote Access: Manage containers remotely via the secure REST API. Networking Mastery: Configure virtual networks to isolate containers and simulate complex network topologies for testing. Storage Management: Optimize storage performance with different backends like ZFS or LVM. Profiles for Reusability: Define reusable profiles to simplify container creation with consistent configurations. Snapshots and Rollbacks: Capture container states to revert to previous working configurations, ideal for experimentation quickly. Moving and Migrating Containers: Easily move or migrate running containers between LXD hosts or even to different cloud providers. Pro tip: If you often create containers with similar settings, use profiles. They’ll save you from repeating the same config steps over and over. The Ultimate Ubuntu Handbook shows how to build reusable profiles for real projects. The Future of LXD LXD continues to evolve, with ongoing efforts to integrate with Kubernetes for seamless orchestration and improved virtualization support for demanding workloads. Enhanced security features and a developing web-based user interface (sudo snap set lxd ui.enable=true && sudo snap restart lxd) are also on the horizon, making LXD even more accessible and powerful for developers. The bottom line? LXD is a game-changer for developers on Ubuntu, offering a compelling blend of efficiency, security, and flexibility. By embracing LXD, developers can create efficient, reproducible, and secure environments that streamline workflows, enhance collaboration, and accelerate innovation. Share This Article! Ken’s walkthrough is just a glimpse into what’s possible when you bring containers into everyday development on Ubuntu. If you’d like to go deeper, his book The Ultimate Ubuntu Handbook is full of practical examples, from building secure containers to streamlining workflows and preparing for production deployments. It’s a guide designed to stay useful long after the first read. Get The Book 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

18 Aug 2025

6 min read

The Hidden Platform Lesson Behind Airbnb’s Data Quality Framework

Shreyans from Packt

18 Aug 2025

6 min read

A Blueprint for Smarter Platform Engineering CloudPro #103 Cheers, Shreyans Singh Editor-in-Chief Share This Article! AI Powered Platform Engineering Workshop Most platform teams hit the same wall: tools pile up, self-service breaks down, and platforms get rebuilt every 18 months. On September 16, join CNCF Ambassador Max Körbächer for a 3-hour live workshop on how to design internal platforms with AI baked in. LAUNCH OFFER: 40% OFF for 48 Hours Use Code: LIMITED40 The Hidden Platform Lesson Behind Airbnb’s Data Quality Framework A Blueprint for Smarter Platform Engineering Share This Article! A few years back, Airbnb hit a painful truth: a single data bug could quietly poison dashboards, mislead teams, and steer decisions the wrong way. To deal with it, Airbnb launched a Data Quality Initiative. The company rolled out Midas, a certification process for critical datasets, and made checks for accuracy, completeness, and anomalies mandatory. It sounded good on paper. But in practice, it quickly turned into a mess. Every team wrote checks in their own way: Hive, Presto, PySpark, Scala. There was no central view of what was covered, and updating rules meant editing code in a dozen different places. Teams duplicated effort, each building their own half-complete frameworks to run checks. And pipelines grew heavy: every check was an Airflow task, DAG files ballooned, and false alarms could block production jobs. Airbnb needed a better path. So, they built Wall, a single framework for checks. Instead of custom code, engineers wrote checks in YAML configs. An Airflow helper ran them, keeping logic separate from pipelines. Wall added support for blocking vs. non-blocking checks, so minor issues didn’t stop critical flows. And instead of burying results in logs, it sent them into Kafka for other systems to consume. The results were dramatic. Some pipelines shed more than 70% of their Airflow code. Teams stopped reinventing the wheel. Data-quality checks went from fragile and inconsistent to a paved path everyone could rely on. What Platform Engineers Can Learn from Airbnb 1. Standardize Pipelines and Checks In Mastering Enterprise Platform Engineering, authors Mark and Gautham makes a simple point: reliable AI doesn’t come from choosing the perfect model. It comes from the rails underneath: pipelines, integration layers, automation, and guardrails. The numbers are striking: clean pipelines alone can improve model performance by 30%, audits can cut errors by 90%, flexible integration can halve deployment time, and predictive automation can reduce downtime by 50%. Reliable AI starts with consistent pipelines and repeatable checks. When pipelines are clean and quality checks are routine, the entire system gets more predictable. That’s why standardized checks and audits can boost accuracy so dramatically. Airbnb learned this the hard way. Each team had its own approach to checks, spread across different engines. The duplication and inconsistency created constant drag. Wall fixed it by moving to YAML-based checks in a single framework. Suddenly, teams were speaking the same language. Some pipelines saw their DAGs shrink by more than 70%. 2. Decouple Checks from Workflow Code One of the biggest risks in complex systems is tangling logic together. When validation lives inside workflow code, every change increases fragility. By pulling checks out into their own layer, you gain flexibility, reuse, and resilience. Wall embodied this. Instead of clogging DAGs with checks, it made them independent services. Results flowed into Kafka, where other systems could consume them. Checks weren’t bound to pipelines anymore; they became a decoupled, reusable rail. 3. Close the Loop with Automation Validation without action is just noise. The real value comes when checks automatically trigger responses: scaling a service, blocking a bad job, or notifying the right team. This kind of predict→act loop is where platform engineering proves its worth. Wall pushed Airbnb in this direction. By publishing results as Kafka events, checks could plug into downstream tools that acted immediately. Instead of waiting for humans to parse dashboards, the system closed the loop itself. 4. Build Guardrails, Not Just Tests Not every failed check should bring the system down. The right approach is to design guardrails: rules that let you decide what’s a blocker and what’s not. This keeps the platform safe without making it brittle. Wall introduced blocking vs. non-blocking checks to solve this. Critical issues stopped the flow; minor ones didn’t. That simple design choice turned fragile pipelines into resilient ones. Guardrails, not tests, are what kept the system trustworthy. 5. Standardize Tools to Reduce Friction Every extra framework, every redundant tool, adds friction. Engineers spend more time maintaining and less time building. The fix is to standardize on a common set of rails, even if it means trade-offs. Airbnb saw this firsthand. With every team writing their own frameworks, they were duplicating effort and missing features. Wall gave them a single standard, cutting out wasted work. Once engineers stopped arguing about how to check data, they could focus on using it. Share This Article! This walkthrough was adapted from Mastering Enterprise Platform Engineering and connects to Packt’s AI-Powered Platform Engineering Workshop on September 16. It’s a live, 3-hour session led by CNCF Ambassador Max Körbächer, focused on how to build internal platforms with AI baked in: platforms that stay usable, scalable, and sustainable. CloudPro readers get 40% off tickets for the next 48 hours. LAUNCH OFFER: 40% OFF for 48 Hours Use Code: LIMITED40 Sponsored: Staying sharp in .NET takes more than just keeping up with release notes. You need practical tips, battle-tested patterns, and scalable solutions from experts who’ve been there. That’s exactly what you’ll find in .NETPro, Packt’s new newsletter, with a free eBook waiting for you as a welcome bonus. Sign Up. Join Christoffer Noring (Senior Advocate at Microsoft, GDE, Oxford tutor) for a hands-on 2.5h MCP workshop. Go beyond theory: build and deploy a real MCP client and server in Python, get a free MCP eBook, and leave with a certificate of completion. Reserve your spot. 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

04 Aug 2025

6 min read

ReplicaSet ≠ High Availability (Until You Test This)

Shreyans from Packt

04 Aug 2025

6 min read

Pods fail, nodes go down. This walkthrough shows what actually happens, and how to fix it.CloudPro #102ReplicaSet ≠ High Availability (Until You Test This)30 second summary of today's CloudPro for you:Running your app in Kubernetes doesn’t automatically make it highly available. This walkthrough shows how ReplicaSets handle pod failures, node loss, and unhealthy containers, and what really happens behind the scenes when things go wrong. Adapted from The Kubernetes Bible.> 8-minute read> Hands-on commands included> Bonus at the end for readers like youCheers,Shreyans SinghEditor-in-ChiefShare This Article!The Problem: One Dead Pod, and Your App StallsLet’s say you’ve got a stateless NGINX app deployed in a multi-node Kubernetes cluster using a ReplicaSet. You think you’re covered because there are 4 replicas. But then you:delete a pod manuallydrain one of the nodessimulate a container failureIn all three cases, you’re expecting automatic recovery. But it’s not magic. It's ReplicaSet (and sometimes liveness probes) doing the heavy lifting.Let’s walk through all three failure modes and see what Kubernetes does.Pod Deletion? No Problem.This scenario demonstrates how a ReplicaSet restores deleted pods to maintain the desired number of replicas.Here's a step-by-step walkthrough:1. Define the ReplicaSet manifest: Save the following YAML as nginx-replicaset-example.yaml:apiVersion: apps/v1kind: ReplicaSetmetadata: name: nginx-replicaset-example namespace: rs-nsspec: replicas: 4 selector: matchLabels: app: nginx environment: test template: metadata: labels: app: nginx environment: test spec: containers: - name: nginx image: nginx:1.17 ports: - containerPort: 802. Create the namespace: This ensures all your resources are scoped properly.kubectl create -f ns-rs.yaml3. Deploy the ReplicaSet: The manifest defines a ReplicaSet with 4 NGINX pods.kubectl apply -f nginx-replicaset-example.yaml4.Delete a pod manually: Simulate a pod failure by deleting one of the running pods.kubectl delete pod <pod-name> -n rs-ns5.Verify that the ReplicaSet restores the pod: The controller detects the change and automatically spins up a new pod to maintain the desired count.kubectl get pods -n rs-nskubectl describe rs/nginx-replicaset-example -n rs-nsWithin seconds, the ReplicaSet controller notices the missing pod and recreates it to meet the declared replica count.Takeaway:ReplicaSets automatically maintain the number of desired pods, making recovery from manual deletions fast and hands-free.2. Node Failure? Here's What Actually HappensThis scenario demonstrates how ReplicaSets maintain high availability when a node goes down by rescheduling pods onto available nodes:Here's a step-by-step walkthrough:1. Expose your app with a Service:kubectl apply -f nginx-service.yamlThis creates a service to access your app across pods.2. Forward traffic from your local machine to the Kubernetes Service:kubectl port-forward svc/nginx-service 8080:80 -n rs-nscurl localhost:8080This confirms your service is working and traffic is flowing to the pods.3. Check where the pods are currently running:kubectl get pods -n rs-ns -o wideThis shows which node each pod is scheduled on.4. Simulate node failure by cordoning and draining the node:kubectl cordon kind-workerPrevents new pods from being scheduled on this node.kubectl drain kind-worker --ignore-daemonsetsEvicts all running pods from the node while ignoring daemonsets.kubectl delete node kind-workerRemoves the node from the cluster to simulate a full node failure.Within moments, the ReplicaSet detects the missing pods and spins up new ones on the remaining healthy nodes. Your Service automatically reroutes traffic to these new pods.5. Verify that everything is still working:kubectl get pods -n rs-ns -o widecurl localhost:8080You’ll see that traffic still flows, and the app remains accessible without downtime.Takeaway:The ReplicaSet ensures that the desired number of pod replicas is always maintained, even when a node goes offline. It handles pod rescheduling automatically, as long as there's sufficient capacity in your cluster.3. Unhealthy Container? Probes Save the DayLet’s see how Kubernetes handles an unhealthy container using liveness probes.Here's a step-by-step walkthrough:1. Add the following liveness probe to your ReplicaSet pod spec. It instructs the kubelet to check container health after 2 seconds and repeat every 2 seconds:livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 2 periodSeconds: 2Apply your updated ReplicaSet manifest and wait for the pod to be up and running.Simulate a container failure by deleting the default NGINX index file:kubectl exec -it <pod-name> -- rm /usr/share/nginx/html/index.htmlCheck what happens by describing the pod:kubectl describe pod <pod-name>You’ll see Liveness probe failed events, followed by automatic container restarts.Takeaway: The kubelet, not the ReplicaSet, manages container health. But when used with ReplicaSets, probes help create a resilient system that self-heals when a container goes bad.CleanupYou can delete the ReplicaSet and its pods:kubectl delete rs/nginx-replicaset-livenessprobe-exampleOr just delete the controller, leaving pods untouched:kubectl delete rs/nginx-replicaset-livenessprobe-example --cascade=orphanKey TakeawaysReplicaSets guarantee pod replication and replacement—not health checkingLiveness probes enable kubelet to restart broken containersNode failure recovery works if your cluster has enough capacity and replicas are spreadHA = ReplicaSets + Probes + Services, working in tandem👋This walkthrough was adapted from just one chapter of The Kubernetes Bible, Second Edition: a 720-page, hands-on guide to mastering Kubernetes across cloud and on-prem environments.If you’re tackling real production workloads or preparing for certs like CKA/CKAD/CKS, the book dives deeper into everything from ReplicaSets and Deployments to StatefulSets, autoscaling, Helm, traffic routing, and advanced security practices.For the next 72 hours, CloudPro readers get 30% off the ebook and 20% off print.Order NowSponsored:Curious how AI is changing secure coding? Join Sonya Moisset from Snyk on Aug 28 to explore real-world strategies for protecting your AI-driven SDLC and earn a CPE credit while you're at it. Register now.Want faster builds and better mobile apps? Learn proven CI/CD tips from Bitrise and Embrace experts to speed up development and ship higher-quality apps. Register here.📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

28 Jul 2025

7 min read

Kubernetes v1.33 now supports hybrid post-quantum TLS key exchange by default

Shreyans from Packt

28 Jul 2025

7 min read

AWS has launched a developer preview of the API MCP ServerCloudPro #101Daily Cloud Insights. Follow Packt SysOps.Follow Packt SysOps on LinkedInIn this week’s issue, there’s a quick fix for bloated Terraform states, a clean Docker Compose alternative with Quadlet, and new AWS features like remote Lambda debugging and native blue/green ECS deploys.You’ll also find a GitOps primer with Argo CD, a free Kubernetes IDE, and real benchmark data on which Gateway API controllers hold up at scale.If you want updates like these daily, not just weekly, follow Packt SysOps. One practical post every weekday at 9AM ET, with lessons from real cloud teams.Cheers,Shreyans SinghEditor-in-Chief📦 Kubernetes & Cloud NativeKubernetes v1.33 now supports hybrid post-quantum TLS key exchange by default, thanks to its upgrade to Go 1.24. This enables X25519+ML-KEM (Kyber) for TLS without explicit configuration. However, mismatched Go versions across clients and servers can cause silent downgrades to classical encryption. PQ signatures aren’t yet production-ready due to large key sizes and limited tooling support.EKS Auto Mode now supports pod-specific subnets using podSubnetSelectorTerms in EKS Node Classes. This allows developers to assign pods IPs from custom subnet ranges, improving network isolation. Combined with Karpenter Node Pools and Terraform automation, teams can now declaratively manage these configurations at scale.Intro to GitOps with Argo CDThis beginner-friendly guide explains GitOps and shows how to deploy Argo CD to automate Kubernetes app delivery. It walks through installing Argo CD, exposing it via Ingress, and logging in securely, eliminating complex CD pipelines and simplifying multi-team access.Free IDE for KubernetesFreelens is a free, open-source desktop app for managing Kubernetes clusters, available for macOS, Linux, and Windows via multiple package managers. Built as a fork of OpenLens, it simplifies cluster operations with a clean UI, bundled CLI tools, and extension support.A new open-source benchmark suite tests seven major Gateway API implementations, like Istio, Envoy Gateway, and Traefik, across route setup, scaling, architecture, and performance. The results show large differences in reliability and scalability, with Istio and Kgateway standing out positively, while Nginx, Cilium, and Traefik suffered critical failures or severe scaling issues. For cloud engineers, this benchmark helps cut through marketing claims and highlights which controllers are production-ready.⚙️ Infrastructure & DevOpsGoogle Cloud Run Adds Native Support for Docker Compose AI DeploymentsNow in private preview, this simplifies moving multi-container AI apps from local to cloud with GPU support and persistent volumes. Cloud Run’s recent GPU GA and fast scaling make it a strong platform for agentic and LLM workloads.Google Cloud Expands Cluster Director with GUI, Managed Slurm, and Anomaly Detection.Users can launch optimized clusters with GPU, network, and storage setup in under a day, with built-in topology-aware scheduling and straggler detection. The updates aim to reduce setup time and improve performance for large-scale distributed training.AWS has launched a developer preview of the API MCP Server, allowing foundation models to convert natural language into valid AWS CLI commands. This tool enables FM-powered agents to inspect and manage AWS resources securely through IAM-based permissions. It's open source and now available on GitHub for experimentation.Amazon Bedrock AgentCore is now in preview, offering modular services to help developers run AI agents at scale with production-grade security and observability. It includes tools for session management, memory, API integration, web browsing, code execution, and identity control.AWS has launched two new features for Lambda: console-to-IDE integration and remote debugging. Developers can now open Lambda functions directly in VS Code with a single click, and debug cloud-deployed functions live from their IDE, including access to VPCs and IAM roles.Amazon ECS now supports native blue/green deployments, making it easier to roll out application updates safely without custom tooling. You can test new revisions in parallel, use lifecycle hooks for automated validation, and instantly roll back if needed, all with no end-user disruption.🔐 Cloud SecurityAWS Fixes Flaw That Allowed Full Org Takeover via Delegated AdminsResearchers found a way to take over entire AWS Organizations by combining misconfigured delegated admin accounts with an overly permissive managed policy. A user in a compromised account could gain control of every account, including the management account. AWS has released a fixed policy (v2), but the old version is still active if not manually replaced. Teams should audit delegated admin roles and update any remaining v1 policies immediately.AWS IAM Action Classifications Updated. But Inconsistencies RemainFog Security found mismatches between AWS’s new programmatic IAM action listings and the older Service Authorization Reference (SAR) pages. Some actions have multiple classifications, others are missing or categorized differently across the two sources. These inconsistencies could affect IAM tooling and workflows. Teams using SAR data should review the differences before switching to the new programmatic references.CDK Construct that syncs your sops secrets into AWS SecretsManager secrets.The cdk-sops-secrets project helps developers securely sync SOPS-encrypted secrets into AWS Secrets Manager or SSM Parameter Store using CDK constructs. It supports JSON, YAML, dotenv, and binary formats, with features like batch uploads and automatic IAM permission generation. The tool also allows customization via a singleton Lambda provider.Serverless Password Manager Built Entirely on AWS Free TierRunaVault is an open-source password manager using AWS Cognito, Lambda, DynamoDB, and KMS to store and share secrets securely. It’s built for zero-cost deployments under the AWS free tier, with features like MFA, RBAC, and client-side encryption.S3 Security Scanner for Access and Ransomware ProtectionYES3 is a Python-based tool that scans AWS S3 buckets for security misconfigurations, including public access, encryption gaps, versioning, and object lock issues. It also checks account-wide settings like public access blocks and logs findings in a readable format.🔍 Observability & SREGoogle Cloud has rolled out a new Application Monitoring feature that auto-generates dashboards, logs, and metrics views for services defined in App Hub. The tool helps teams troubleshoot faster by surfacing golden signals and propagating labels across logs, metrics, and traces. It also integrates with Gemini Cloud Assist.Microsoft has expanded Project Flash to give Azure users deeper, real-time visibility into VM availability disruptions. New features include a context-aware metric in Azure Monitor that distinguishes between platform- and user-triggered issues, and Event Grid integration for instant alerts.Amazon EventBridge now supports enhanced logging to CloudWatch, S3, and Kinesis Firehose, helping teams debug event-driven apps more effectively. Users can choose log levels (error, info, trace), include event payloads, and track rule matches and invocation errors. This makes it easier to trace event flows and spot failures without custom tooling.Amazon EventBridge now offers enhanced logging that tracks the full lifecycle of events, from receipt to success or failure, across CloudWatch, S3, or Firehose. Logs include rich metadata, latency breakdowns, and error details, helping engineers pinpoint issues in Lambda targets or API destinations.AWS S3 Metadata now supports querying metadata for all objects, not just new or updated ones—via fully managed Iceberg tables. Live inventory tables and journal tables enable SQL-based queries to track storage usage, object changes, deletions, and lifecycle activity. This simplifies cost optimization, auditing, and ML pipeline prep by eliminating the need for manual scanning or S3 Inventory jobs.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

14 Jul 2025

7 min read

Microsoft engineers contributed a new authentication method to Grafana

Shreyans from Packt

14 Jul 2025

7 min read

What Would a Kubernetes 2.0 Look LikeCloudPro #99Daily Cloud Insights. Follow Packt SysOps.Follow Packt SysOps on LinkedIn> Grafana now supports Azure managed identities, so you can skip the usual credential headaches. Really useful if you’re juggling OAuth providers.> Google is catching leaked credentials in public repos within minutes, which honestly should’ve been standard by now.> Kubernetes is adding smarter routing for LLM workloads, reducing GPU bottlenecks. Could be worth a look if you’re running GenAI models.> And there’s finally a practical guide for securing OpenTelemetry collectors with proper mTLS in Kubernetes: cleaner architecture for multi-cluster setups.We also have some good reads on safer curl installs, scaling Argo CD, debugging Kubernetes deployments, and cutting observability costs without sacrificing coverage.Already get your weekly CloudPro updates? Packt SysOps keeps you sharp every single day. One quick, practical post every day at 9AM, covering cloud security fixes, Kubernetes tips, DevOps tooling, and scaling lessons from real teams. Follow the page. Stay updated in 2 minutes.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityMicrosoft engineers contributed a new authentication method to Grafana, enabling “managed identity” logins tied to Azure’s identity system. This eliminates the need for credentials or certificate rotation by authenticating users based on identity claims. The change allows Grafana users to mix authentication methods and extends to any OAuth 2.0-based identity provider.How Google Cloud is securing open-source credentials at scaleGoogle Cloud has launched automated scanning for leaked Google Cloud credentials in public open-source artifacts like Docker images and package repositories. The system flags credentials within minutes of publication and alerts users via email or product logs. This aims to reduce cloud breaches from credential leaks, which account for 16% of incidents.Building a cloud security roadmap: Tools by layer and when you need themGrounded Cloud Security published a detailed guide on choosing security tools based on cloud architecture layers: control plane, orchestration, platform, and application. It explains common threats like API key leaks, container misconfigurations, and application exploits, mapping them to tools like CNAPP, CSPM, KSPM, and PAM.Exposing OpenTelemetry Collector Securely with Gateway API and mTLSA new guide explains how to securely expose OpenTelemetry Collectors in Kubernetes using the Gateway API with mutual TLS. This setup helps teams aggregate telemetry from external apps, multi-cluster services, or hybrid environments while enforcing strong authentication. The approach uses Istio’s Gateway API and mTLS to protect gRPC endpoints.AWS published a step-by-step guide on building a secure serverless streaming pipeline using Amazon MSK Serverless and EMR Serverless with IAM authentication. It shows how to ingest data with Kafka, process it via Spark Structured Streaming, and query outputs in S3 using Athena. This design eliminates manual TLS setups, simplifies scaling, and enforces IAM-based access control—ideal for teams seeking managed, low-ops streaming pipelines.A new CLI tool called vet has launched to secure the common curl | bash install pattern. It fetches remote install scripts, shows diffs from previous runs, runs ShellCheck for linting, and requires user approval before execution. vet targets DevOps teams wanting safer automation workflows, reducing risk from blind script execution.⚙️ Infrastructure & DevOpsGoogle Cloud and Docker are simplifying AI app deployment with native support for Docker Compose on Cloud Run. Developers can now use gcloud run compose up to deploy multi-container AI apps from a local compose.yaml file, including GPU-backed models, with one command.Google Cloud detailed strategies for optimizing GKE workload scheduling when resources are tight. Techniques include workload priorities, balloon pods for quick scaling, compute classes for fallback node types, and multi-cluster setups to “capacity chase” across regions. This helps platform teams maintain performance while balancing cost and resource availability.This guide outlines how to deploy a production-ready, self-managed MySQL 8.0 instance on Google Cloud using OpenTofu/Terraform. It emphasizes enterprise-grade practices like secret management with Google Secret Manager, Shielded VM security, automated backups to Cloud Storage, and modular IaC design. Ideal for teams needing fine-grained control over database infrastructure without sacrificing security or operational standards.Simplifying platform engineering at John Lewis - part two | Google Cloud BlogJohn Lewis built a custom Kubernetes controller on Google Cloud to abstract complex Kubernetes configurations for developers. Their Microservice CRD reduces YAML complexity, enforces best practices, and automates features like Prometheus configs and service mesh enrollment.Apptainer, the open-source container platform for HPC environments, has released version 1.4.1 with improved OCI (Open Container Initiative) build support and better integration with BuildKit. It continues to focus on secure, portable containers with an immutable single-file format, supporting GPUs and parallel filesystems.📦 Kubernetes & Cloud NativeA new Inference Extension for Kubernetes Gateway API introduces model-aware traffic routing for LLM and GenAI workloads. It enables smarter request distribution using live model metrics like queue length and GPU load, reducing latency and improving GPU efficiency. Early benchmarks show lower tail latencies compared to standard Kubernetes Services, especially at high QPS levels.What Would a Kubernetes 2.0 Look LikeKubernetes should fix long-standing pain points in a future 2.0 version: ditch YAML for HCL to avoid type errors, replace etcd with pluggable backends like SQLite/Raft for smaller clusters, and introduce a native package manager to replace Helm’s fragile templating. Other ideas include IPv6 by default and simpler networking for more scalable and developer-friendly clusters.How Argo CD Handles 500+ vClusters and Where It BreaksA new deep-dive shows the scaling limits of Argo CD on a control plane managing 1,000 virtual clusters (vClusters) with GitOps. Performance remained stable up to ~500 clusters and ~500 apps, but beyond that, Argo CD controllers hit memory limits and UI became sluggish. The test highlights practical scaling ceilings and tuning tips for multi-tenant GitOps setups on Kubernetes.KubeDiagrams, the open-source tool for generating Kubernetes architecture diagrams, released v0.4.0 with a new --namespace option and improved support for custom resources. It now handles over 47 native Kubernetes types and integrates with Helm, Helmfile, and actual cluster states. This update makes it easier for platform teams to auto-document infrastructure directly from manifests or live clusters.🔍 Observability & SREOllyGarden has introduced the Instrumentation Score, a new open-source standard to measure the quality of OpenTelemetry data. It analyzes telemetry streams against best practices and semantic conventions, giving teams a clear numerical score to assess instrumentation health.A major outage on June 12, 2025, took down Google’s Identity and Access Management (IAM) system, affecting authentication across Firebase and other core services. This follows a similar 2023 incident and highlights risks of central authentication failures in serverless architectures. For cloud teams, it’s a fresh reminder of the need for multi-region failover and alternative authentication strategies.Gigapipe has introduced a fixed-cost observability platform that combines logs, metrics, traces, and profiling into a single backend. It offers compatibility with OpenTelemetry, Loki, Prometheus, Tempo, and Pyroscope without requiring custom agents. This could simplify observability stacks for cloud teams while avoiding variable usage-based costs.Dynatrace now supports querying OpenTelemetry data using natural language via its MCP server and GitHub Copilot integration. Engineers can ask conversational questions in VSCode to retrieve logs, traces, and metrics directly from Dynatrace. This can simplify querying for teams still learning DQL and improve OTel workflows without needing deep query syntax knowledge.InfraSight is a new open-source observability stack using eBPF for real-time syscall tracing on Linux and Kubernetes. It captures events like process execution, file access, and network connections, streaming data to ClickHouse for fast querying. With gRPC pipelines, Kubernetes CRDs, and Helm charts, it aims to simplify low-level infrastructure observability without application changes.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

07 Jul 2025

7 min read

Kubernetes Faces Gaps in Handling Device Failures for AI/ML Pods

Shreyans from Packt

07 Jul 2025

7 min read

Uber Cuts CI Costs by 53% Using Smarter Build PrioritizationCloudPro #98One of the few GenAI tools that actually feels built for engineersMost GenAI tools just dress up autocomplete. Shield’s AmplifAI is different. It uses agentic AI, systems that reason and act across steps, to take real work off your plate.Think: auto-surfacing hidden compliance risks, navigating tangled comms threads, explaining every decision clearly. No magic, just well-architected automation with human-in-the-loop guardrails.If you're curious what useful AI looks like in practice, start here.Learn More> Attack graphs are redefining IAM risk modeling from the ground up> Airbnb’s load testing framework bakes chaos into CI/CD> Kubernetes is still awkward with GPU failures, and no one’s fixed it yetPlus: SRE agents with $21M backing, mirrord’s new team debugging trick, and visual Kubernetes troubleshooting that finally makes sense.Cheers,Shreyans SinghEditor-in-ChiefNetwork security that just works: no apps, no frictionSecurity shouldn’t depend on whether your users remember to install something. That’s why I found Whalebone so interesting: it protects millions of devices from phishing, malware, and scams at the DNS level, no downloads required.It’s cleanly integrated, telco-ready, and surprisingly quick to deploy (2 months). Telcos like O2 and A1 are already using it to boost ARPU while quietly shielding users in the background.For teams building secure, seamless infra:Learn More🔐 Cloud SecurityWhy Default Pod Communication in Kubernetes is a Security RiskBy default, all pods in a Kubernetes cluster can talk to each other, which simplifies app deployment but opens up security risks. Network policies are the main way to restrict this traffic, using labels and namespaces to control ingress and egress. Support for policies depends on your CNI plugin: tools like Calico enable advanced rules, while others like flannel do not.Why IAM demands an Attack Graph first approachMost IAM programs start with static access lists, but attackers exploit paths, not lists. An Attack Graph shows how identities and permissions can be chained for lateral movement and takeover. By modeling these paths first, security teams can prioritize real, exploitable risks and fix what matters. This shift helps align identity security with how attacks actually happen, not just how access is managed.12-Month Cloud Security Challenge Just Dropped – Practice, Compete, and Get CertifiedWiz has launched Cloud Champions, a monthly CTF challenge series focused on real-world cloud security scenarios. Each challenge is crafted by Wiz researchers and designed to help practitioners sharpen their skills through hands-on problem-solving. The first challenge, “Perimeter Leak,” went live in June, with more slated through May 2026. A leaderboard tracks participant progress and highlights top performers.Building AI agents that hunt like cloud adversariesSecurity researchers are building AI agents that think and act like advanced cloud attackers: chaining permissions, pivoting across services, and executing real-world privilege escalation paths in AWS. These agents outperform traditional tools by reasoning contextually and automating multi-step attack logic.Simplify Kubernetes Security With Kyverno and OPA GatekeeperKyverno and OPA Gatekeeper help secure Kubernetes by blocking risky configurations before they’re deployed. Kyverno is easier to use, with YAML policies and native Kubernetes integration, while OPA Gatekeeper offers deeper flexibility using Rego for complex rules. Both tools can enforce critical security practices, like banning :latest image tags, to improve cluster safety and compliance.⚙️ Infrastructure & DevOpsUber Cuts CI Costs by 53% Using Smarter Build PrioritizationUber enhanced its SubmitQueue CI system to reduce CPU usage by 53% and cut wait times by 37% across its massive monorepos. The update uses a new probabilistic model to prioritize builds that are more likely to succeed or unblock smaller changes. This lets faster commits bypass larger ones.Figma spends $300,000 on AWS dailyFigma disclosed in its IPO filing that it now spends nearly $300,000 daily on AWS, committing to $545 million over five years. The design platform is fully dependent on AWS infrastructure and policies, highlighting vendor lock-in risks.TOP 10 DevOps Tools in 2025: Based on 300 LinkedIn job postsGitHub Actions, Terraform, Kubernetes, and ArgoCD top the list, praised for integration and power, but not without their quirks. The takeaway: there's no perfect stack, just the right mix for your team’s context and scale.mirrord Adds Queue Splitting to Enable Shared Debugging in the Cloudmirrord for Teams now supports queue splitting, letting developers work on the same service in a shared cloud environment without stepping on each other’s toes. With support for AWS SQS (Kafka and RabbitMQ coming soon), devs can apply filters so only their local app receives relevant messages. This enables real-time debugging with zero disruption to live services or teammates.📦 Kubernetes & Cloud NativeKubernetes Faces Gaps in Handling Device Failures for AI/ML PodsAs AI/ML workloads relying on GPUs become more common, Kubernetes struggles with device failure modes like partial GPU outages, degraded performance, and scheduling fragility. DIY fixes exist, but lack standardization, and core systems don’t correlate device health with pod behavior.Simplifying platform engineering at John Lewis - part one | Google Cloud BlogJohn Lewis replaced its monolithic commerce system with a multi-tenant, microservice-based architecture on Google Kubernetes Engine. A central “paved road” platform now automates provisioning, observability, and security, letting product teams deploy independently while maintaining guardrails. This approach boosts developer velocity, minimizes cognitive load, and balances consistency with flexibility as new services emerge.A visual guide on troubleshooting Kubernetes deploymentsAzure Boosts PostgreSQL Performance on AKS With Local NVMe & CloudNativePGMicrosoft now supports high-performance PostgreSQL on Azure Kubernetes Service using local NVMe via Azure Container Storage and the CloudNativePG operator. Benchmarks show up to 26,000 TPS with sub-5ms latency. For price-sensitive workloads, Premium SSD v2 offers flexible scaling and solid performance.🔍 Observability & SREAirbnb Scales Load Testing with Impulse FrameworkAirbnb developed Impulse, a decentralized load-testing framework integrated with CI/CD, to help teams test service reliability at scale. It includes a context-aware load generator, dependency mocker, traffic replay collector, and synthetic API generator for async flows.How we're building an agentic system to drive Grafana | Grafana LabsGrafana is moving beyond simple AI chat responses by building agentic systems that can reason and take action, like creating dashboards or debugging metrics, based on real-time context. Powered by the open source MCP Server, these agents interact with Grafana APIs to perform complex, multi-step workflows.Ciroos Launches AI SRE Teammate with $21M in FundingCiroos has raised $21 million to launch its AI-powered “SRE Teammate,” a multi-agent system that autonomously detects, diagnoses, and resolves incidents across cloud, Kubernetes, and networking environments. Unlike traditional observability tools, it acts like an expert partner, correlating signals and automating root-cause analysis without runbooks.Benchmarking OpenTelemetry Overhead in Go ApplicationsA recent benchmark measured the performance impact of enabling OpenTelemetry tracing in a Go app under 10,000 req/s. CPU usage rose ~35% and memory jumped from 10MB to 15–18MB, mostly due to span processing. p99 latency increased by ~5ms, and outbound telemetry added 4MB/s of network traffic.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

30 Jun 2025

8 min read

Migrating Uber’s Compute Platform to Kubernetes

Shreyans from Packt

30 Jun 2025

8 min read

How to Break Up a Terraform Terralith Without Breaking EverythingCloudPro #97All Books $9.99 | 8 Hours RemainingSHOP NOW1. AWS’s own security tool introduced a privilege escalation risk2. Terraliths slowing you down? Here's how to break them up safely3. Uber’s 3M-core migration to Kubernetes: what it really tookPlus: BitM attacks that bypass MFA, schema migration via CI/CD, and a no-fluff guide to how Kubernetes CRDs actually work.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityAWS Launches Threat Technique Catalog to Share Real-World Attack DataAWS has released the Threat Technique Catalog, a resource mapping real-world attack techniques seen in customer incidents to the MITRE ATT&CK framework. Built from AWS CIRT investigations, it includes detection and mitigation advice for tactics like token abuse and misconfigured encryption. This gives cloud defenders a practical way to strengthen their AWS environments using adversary-informed data.AWS Launches Preview of Upgraded Security HubAWS has released a preview of its revamped Security Hub, now offering integrated dashboards, exposure mapping, and attack path visualizations to better prioritize and respond to security threats. It correlates findings across GuardDuty, Inspector, Macie, and CSPM to highlight critical gaps and risks.AWS Built a Security Tool. It Introduced a Security Risk.AWS’s “Account Assessment for AWS Organizations” tool unintentionally introduced a cross-account privilege escalation risk due to insecure deployment instructions. It advised users to avoid the management account without clarifying that deploying the hub role in a less secure account could expose high-sensitivity environments. AWS has since updated its documentation to recommend using a secure account.Forgotten DNS Records Enable CybercrimeA threat actor dubbed Hazy Hawk is hijacking abandoned cloud resources, like AWS S3 buckets and Azure endpoints, through dangling DNS records. By taking over subdomains of major organizations, including CDC, Deloitte, and universities, they reroute users to scams and malware via complex traffic distribution systems. The attacks exploit subtle DNS misconfigurations and show how unmanaged cloud resources can silently expose enterprise users to persistent threats.Browser-in-the-Middle Attacks Bypass MFA to Steal Sessions in Real TimeMandiant warns of a growing threat called Browser-in-the-Middle (BitM), where attackers proxy real login pages through their own browsers to steal fully authenticated sessions, even after MFA. BitM tools like Mandiant's internal “Delusion” make this scalable and fast, bypassing traditional phishing protections. Only hardware-backed MFA like FIDO2 or client certificates can reliably block these attacks.Workshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:-The top LLM vulnerabilities-Proven best practices for securing AI-generated code-Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register today⚙️ Infrastructure & DevOpsAWS CloudTrail Adds Detailed Logging for S3 Bulk DeletesAWS CloudTrail now logs individual object deletions made via the S3 DeleteObjects API, not just the bulk operation. This gives teams clearer visibility into which files were removed, improving audit trails and helping meet compliance and security needs. Granular logs also allow finer control via event selectors.AWS Backup adds new Multi-party approval for logically air-gapped vaultsAWS Backup now supports multi-party approval for logically air-gapped vaults, allowing secure recovery even if your AWS account is compromised. Admins can assign trusted approval teams to authorize vault access from outside accounts. This provides an independent, auditable recovery path, strengthening ransomware resilience and governance for critical backups.Inside AWS’s Strategy for Building Bug-Free, High-Performance SystemsAWS shared how it integrates formal and semi-formal methods, like TLA+, model checking, fuzzing, and deterministic simulation, into everyday development to eliminate bugs, boost developer speed, and enable aggressive optimizations. Tools like the P language and PObserve are used across S3, DynamoDB, EC2, and Aurora to model distributed systems, validate runtime behavior, and prove correctness of critical code paths.How to Break Up a Terraform Terralith Without Breaking EverythingLarge monolithic Terraform setups (“Terraliths”) can slow down deploys and increase risk. This guide lays out a clean migration path, starting with dependency mapping and backups, then moving to new root modules using import and removed blocks (in TF 1.7+), or scripted state mv operations. It also covers real-world lessons on inter-module communication, safe rollouts, automation, and state isolation, helping teams modernize IaC safely and modularly.Why It’s Time to Automate Your Database Schema MigrationsMany teams automate their app deployments but still manage database changes manually, leaving room for human error, schema drift, and security risks. This guide explains how tools like Atlas bring schema migrations into your CI/CD pipelines using declarative definitions, automatic diffs, and linting. The result: safer deployments, fewer production credentials, and consistent environments.📦 Kubernetes & Cloud NativeAmazon EKS Pod Identity adds cross-account access supportAmazon EKS Pod Identity now supports cross-account resource access without code changes. You can assign a second IAM role from another AWS account when creating a pod identity, enabling secure access to resources like S3 or DynamoDB via IAM role chaining. This simplifies multi-account architectures in EKS and reduces the complexity of credential management.Amazon GuardDuty expands Extended Threat Detection coverage to Amazon EKS clustersAmazon GuardDuty now detects advanced attack sequences in EKS clusters by correlating signals across audit logs, runtime activity, and API usage. This helps uncover threats like privilege escalation and secret exfiltration that might be missed by isolated alerts. It gives security teams a complete view of Kubernetes compromises and reduces time to investigate and respond.How CRDs Extend and Hook into the Kubernetes APIThis deep dive explains how Kubernetes Custom Resource Definitions (CRDs) work behind the scenes. It walks through how CRDs register with the Kubernetes API, how schemas validate custom objects, and how controllers fetch and handle them via client-go. You’ll learn how CRDs are serialized, discovered, and routed through the aggregation layer, giving you a detailed mental model for building robust Kubernetes extensions.Migrating Uber’s Compute Platform to KubernetesUber migrated all stateless services, powering 3M+ cores and 100K daily deployments, from Mesos to Kubernetes to standardize infrastructure and tap into the cloud-native ecosystem. They tackled extreme scale (7,500-node clusters), rebuilt integrations, and automated the shift using their internal “Up” platform. Custom solutions like artifact preservation, gradual scaling, and rollout heuristics ensured reliability, while Kubernetes UI and scheduler tweaks enabled smooth operations.Stop Building Platforms Nobody Uses: Pick the Right Kubernetes Abstraction with GitOpsThis post calls out a common pitfall: over-engineering internal platforms that developers don’t adopt. It argues that real developer pain: context switching, CI/CD complexity, insecure YAML sprawl, must shape the abstraction layer. Tools like Kro and Score can simplify Kubernetes via GitOps, but only when they reduce complexity without hiding critical decisions. The message: build abstractions that solve real problems, not just tick architectural boxes.🔍 Observability & SREAmazon VPC Route Server announces logging enhancementsAWS has added new monitoring features to VPC Route Server, including real-time logs for BGP and BFD sessions, historical data tracking, and flexible delivery via CloudWatch, S3, and Firehose. This helps engineers troubleshoot connectivity issues faster without needing AWS Support.Amazon Athena adds managed query results with built-in storage and cleanupAmazon Athena now supports managed query results, eliminating the need to preconfigure S3 buckets or manually clean up old results. This simplifies analysis workflows, especially for teams using automated workgroup creation.Grepr - Dynamic ObservabilityGrepr launched an ML-powered observability pipeline that filters, aggregates, and routes telemetry data before it hits your tools, reducing log volumes and storage costs significantly. It can scale automatically, backfill data during incidents, and runs alongside existing setups with minimal config. Ideal for teams seeking cost control without losing visibility.Chip auto-detects root causes without manual alerting or dashboardsChip is a zero-config monitoring agent that auto-instruments apps and alerts only on real customer-impacting issues. It tracks everything from code commits to Kubernetes events to find root causes fast, using real-time outlier and cohort detection. Built for fast-moving teams who want signal without the noise.Parseable offers fast, open-source observability on S3 with low resource useParseable is a lightweight, S3-first observability platform designed for speed and cost-efficiency. It delivers 90% faster queries than Elastic, uses up to 70% less CPU/memory, and integrates easily with AI and observability tools. Fully open source with no vendor lock-in.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!Disclaimer: Some eBooks and videos are excluded from the $9.99 offer. For selected countries, tiered discount pricing may vary.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

AI That Runs Entirely Offline: How to Build an Offline Enterprise Assistant

How MCP Turns Your IDP into an Actual Teammate

24 Hours Left: AI Powered Platform Engineering Workshop starts tomorrow

Sneak Peek into our AI Platform Engineering Workshop

Don’t miss this: AI-Powered Platform Engineering workshop (Sept 27)

Learn AI Platform Engineering

Batch Scoring on Azure ML

AI alone won’t deliver the autonomous network

Why LXD beats VMs and Docker for Ubuntu dev

The Hidden Platform Lesson Behind Airbnb’s Data Quality Framework

ReplicaSet ≠ High Availability (Until You Test This)

Kubernetes v1.33 now supports hybrid post-quantum TLS key exchange by default

Microsoft engineers contributed a new authentication method to Grafana

Kubernetes Faces Gaps in Handling Device Failures for AI/ML Pods

Migrating Uber’s Compute Platform to Kubernetes

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access