CloudPro

28 Jul 2025

7 min read

Kubernetes v1.33 now supports hybrid post-quantum TLS key exchange by default

28 Jul 2025

AWS has launched a developer preview of the API MCP ServerCloudPro #101Daily Cloud Insights. Follow Packt SysOps.Follow Packt SysOps on LinkedInIn this week’s issue, there’s a quick fix for bloated Terraform states, a clean Docker Compose alternative with Quadlet, and new AWS features like remote Lambda debugging and native blue/green ECS deploys.You’ll also find a GitOps primer with Argo CD, a free Kubernetes IDE, and real benchmark data on which Gateway API controllers hold up at scale.If you want updates like these daily, not just weekly, follow Packt SysOps. One practical post every weekday at 9AM ET, with lessons from real cloud teams.Cheers,Shreyans SinghEditor-in-Chief📦 Kubernetes & Cloud NativeKubernetes v1.33 now supports hybrid post-quantum TLS key exchange by default, thanks to its upgrade to Go 1.24. This enables X25519+ML-KEM (Kyber) for TLS without explicit configuration. However, mismatched Go versions across clients and servers can cause silent downgrades to classical encryption. PQ signatures aren’t yet production-ready due to large key sizes and limited tooling support.EKS Auto Mode now supports pod-specific subnets using podSubnetSelectorTerms in EKS Node Classes. This allows developers to assign pods IPs from custom subnet ranges, improving network isolation. Combined with Karpenter Node Pools and Terraform automation, teams can now declaratively manage these configurations at scale.Intro to GitOps with Argo CDThis beginner-friendly guide explains GitOps and shows how to deploy Argo CD to automate Kubernetes app delivery. It walks through installing Argo CD, exposing it via Ingress, and logging in securely, eliminating complex CD pipelines and simplifying multi-team access.Free IDE for KubernetesFreelens is a free, open-source desktop app for managing Kubernetes clusters, available for macOS, Linux, and Windows via multiple package managers. Built as a fork of OpenLens, it simplifies cluster operations with a clean UI, bundled CLI tools, and extension support.A new open-source benchmark suite tests seven major Gateway API implementations, like Istio, Envoy Gateway, and Traefik, across route setup, scaling, architecture, and performance. The results show large differences in reliability and scalability, with Istio and Kgateway standing out positively, while Nginx, Cilium, and Traefik suffered critical failures or severe scaling issues. For cloud engineers, this benchmark helps cut through marketing claims and highlights which controllers are production-ready.⚙️ Infrastructure & DevOpsGoogle Cloud Run Adds Native Support for Docker Compose AI DeploymentsNow in private preview, this simplifies moving multi-container AI apps from local to cloud with GPU support and persistent volumes. Cloud Run’s recent GPU GA and fast scaling make it a strong platform for agentic and LLM workloads.Google Cloud Expands Cluster Director with GUI, Managed Slurm, and Anomaly Detection.Users can launch optimized clusters with GPU, network, and storage setup in under a day, with built-in topology-aware scheduling and straggler detection. The updates aim to reduce setup time and improve performance for large-scale distributed training.AWS has launched a developer preview of the API MCP Server, allowing foundation models to convert natural language into valid AWS CLI commands. This tool enables FM-powered agents to inspect and manage AWS resources securely through IAM-based permissions. It's open source and now available on GitHub for experimentation.Amazon Bedrock AgentCore is now in preview, offering modular services to help developers run AI agents at scale with production-grade security and observability. It includes tools for session management, memory, API integration, web browsing, code execution, and identity control.AWS has launched two new features for Lambda: console-to-IDE integration and remote debugging. Developers can now open Lambda functions directly in VS Code with a single click, and debug cloud-deployed functions live from their IDE, including access to VPCs and IAM roles.Amazon ECS now supports native blue/green deployments, making it easier to roll out application updates safely without custom tooling. You can test new revisions in parallel, use lifecycle hooks for automated validation, and instantly roll back if needed, all with no end-user disruption.🔐 Cloud SecurityAWS Fixes Flaw That Allowed Full Org Takeover via Delegated AdminsResearchers found a way to take over entire AWS Organizations by combining misconfigured delegated admin accounts with an overly permissive managed policy. A user in a compromised account could gain control of every account, including the management account. AWS has released a fixed policy (v2), but the old version is still active if not manually replaced. Teams should audit delegated admin roles and update any remaining v1 policies immediately.AWS IAM Action Classifications Updated. But Inconsistencies RemainFog Security found mismatches between AWS’s new programmatic IAM action listings and the older Service Authorization Reference (SAR) pages. Some actions have multiple classifications, others are missing or categorized differently across the two sources. These inconsistencies could affect IAM tooling and workflows. Teams using SAR data should review the differences before switching to the new programmatic references.CDK Construct that syncs your sops secrets into AWS SecretsManager secrets.The cdk-sops-secrets project helps developers securely sync SOPS-encrypted secrets into AWS Secrets Manager or SSM Parameter Store using CDK constructs. It supports JSON, YAML, dotenv, and binary formats, with features like batch uploads and automatic IAM permission generation. The tool also allows customization via a singleton Lambda provider.Serverless Password Manager Built Entirely on AWS Free TierRunaVault is an open-source password manager using AWS Cognito, Lambda, DynamoDB, and KMS to store and share secrets securely. It’s built for zero-cost deployments under the AWS free tier, with features like MFA, RBAC, and client-side encryption.S3 Security Scanner for Access and Ransomware ProtectionYES3 is a Python-based tool that scans AWS S3 buckets for security misconfigurations, including public access, encryption gaps, versioning, and object lock issues. It also checks account-wide settings like public access blocks and logs findings in a readable format.🔍 Observability & SREGoogle Cloud has rolled out a new Application Monitoring feature that auto-generates dashboards, logs, and metrics views for services defined in App Hub. The tool helps teams troubleshoot faster by surfacing golden signals and propagating labels across logs, metrics, and traces. It also integrates with Gemini Cloud Assist.Microsoft has expanded Project Flash to give Azure users deeper, real-time visibility into VM availability disruptions. New features include a context-aware metric in Azure Monitor that distinguishes between platform- and user-triggered issues, and Event Grid integration for instant alerts.Amazon EventBridge now supports enhanced logging to CloudWatch, S3, and Kinesis Firehose, helping teams debug event-driven apps more effectively. Users can choose log levels (error, info, trace), include event payloads, and track rule matches and invocation errors. This makes it easier to trace event flows and spot failures without custom tooling.Amazon EventBridge now offers enhanced logging that tracks the full lifecycle of events, from receipt to success or failure, across CloudWatch, S3, or Firehose. Logs include rich metadata, latency breakdowns, and error details, helping engineers pinpoint issues in Lambda targets or API destinations.AWS S3 Metadata now supports querying metadata for all objects, not just new or updated ones—via fully managed Iceberg tables. Live inventory tables and journal tables enable SQL-based queries to track storage usage, object changes, deletions, and lifecycle activity. This simplifies cost optimization, auditing, and ML pipeline prep by eliminating the need for manual scanning or S3 Inventory jobs.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

14 Jul 2025

7 min read

Microsoft engineers contributed a new authentication method to Grafana

Shreyans from Packt

14 Jul 2025

7 min read

What Would a Kubernetes 2.0 Look LikeCloudPro #99Daily Cloud Insights. Follow Packt SysOps.Follow Packt SysOps on LinkedIn> Grafana now supports Azure managed identities, so you can skip the usual credential headaches. Really useful if you’re juggling OAuth providers.> Google is catching leaked credentials in public repos within minutes, which honestly should’ve been standard by now.> Kubernetes is adding smarter routing for LLM workloads, reducing GPU bottlenecks. Could be worth a look if you’re running GenAI models.> And there’s finally a practical guide for securing OpenTelemetry collectors with proper mTLS in Kubernetes: cleaner architecture for multi-cluster setups.We also have some good reads on safer curl installs, scaling Argo CD, debugging Kubernetes deployments, and cutting observability costs without sacrificing coverage.Already get your weekly CloudPro updates? Packt SysOps keeps you sharp every single day. One quick, practical post every day at 9AM, covering cloud security fixes, Kubernetes tips, DevOps tooling, and scaling lessons from real teams. Follow the page. Stay updated in 2 minutes.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityMicrosoft engineers contributed a new authentication method to Grafana, enabling “managed identity” logins tied to Azure’s identity system. This eliminates the need for credentials or certificate rotation by authenticating users based on identity claims. The change allows Grafana users to mix authentication methods and extends to any OAuth 2.0-based identity provider.How Google Cloud is securing open-source credentials at scaleGoogle Cloud has launched automated scanning for leaked Google Cloud credentials in public open-source artifacts like Docker images and package repositories. The system flags credentials within minutes of publication and alerts users via email or product logs. This aims to reduce cloud breaches from credential leaks, which account for 16% of incidents.Building a cloud security roadmap: Tools by layer and when you need themGrounded Cloud Security published a detailed guide on choosing security tools based on cloud architecture layers: control plane, orchestration, platform, and application. It explains common threats like API key leaks, container misconfigurations, and application exploits, mapping them to tools like CNAPP, CSPM, KSPM, and PAM.Exposing OpenTelemetry Collector Securely with Gateway API and mTLSA new guide explains how to securely expose OpenTelemetry Collectors in Kubernetes using the Gateway API with mutual TLS. This setup helps teams aggregate telemetry from external apps, multi-cluster services, or hybrid environments while enforcing strong authentication. The approach uses Istio’s Gateway API and mTLS to protect gRPC endpoints.AWS published a step-by-step guide on building a secure serverless streaming pipeline using Amazon MSK Serverless and EMR Serverless with IAM authentication. It shows how to ingest data with Kafka, process it via Spark Structured Streaming, and query outputs in S3 using Athena. This design eliminates manual TLS setups, simplifies scaling, and enforces IAM-based access control—ideal for teams seeking managed, low-ops streaming pipelines.A new CLI tool called vet has launched to secure the common curl | bash install pattern. It fetches remote install scripts, shows diffs from previous runs, runs ShellCheck for linting, and requires user approval before execution. vet targets DevOps teams wanting safer automation workflows, reducing risk from blind script execution.⚙️ Infrastructure & DevOpsGoogle Cloud and Docker are simplifying AI app deployment with native support for Docker Compose on Cloud Run. Developers can now use gcloud run compose up to deploy multi-container AI apps from a local compose.yaml file, including GPU-backed models, with one command.Google Cloud detailed strategies for optimizing GKE workload scheduling when resources are tight. Techniques include workload priorities, balloon pods for quick scaling, compute classes for fallback node types, and multi-cluster setups to “capacity chase” across regions. This helps platform teams maintain performance while balancing cost and resource availability.This guide outlines how to deploy a production-ready, self-managed MySQL 8.0 instance on Google Cloud using OpenTofu/Terraform. It emphasizes enterprise-grade practices like secret management with Google Secret Manager, Shielded VM security, automated backups to Cloud Storage, and modular IaC design. Ideal for teams needing fine-grained control over database infrastructure without sacrificing security or operational standards.Simplifying platform engineering at John Lewis - part two | Google Cloud BlogJohn Lewis built a custom Kubernetes controller on Google Cloud to abstract complex Kubernetes configurations for developers. Their Microservice CRD reduces YAML complexity, enforces best practices, and automates features like Prometheus configs and service mesh enrollment.Apptainer, the open-source container platform for HPC environments, has released version 1.4.1 with improved OCI (Open Container Initiative) build support and better integration with BuildKit. It continues to focus on secure, portable containers with an immutable single-file format, supporting GPUs and parallel filesystems.📦 Kubernetes & Cloud NativeA new Inference Extension for Kubernetes Gateway API introduces model-aware traffic routing for LLM and GenAI workloads. It enables smarter request distribution using live model metrics like queue length and GPU load, reducing latency and improving GPU efficiency. Early benchmarks show lower tail latencies compared to standard Kubernetes Services, especially at high QPS levels.What Would a Kubernetes 2.0 Look LikeKubernetes should fix long-standing pain points in a future 2.0 version: ditch YAML for HCL to avoid type errors, replace etcd with pluggable backends like SQLite/Raft for smaller clusters, and introduce a native package manager to replace Helm’s fragile templating. Other ideas include IPv6 by default and simpler networking for more scalable and developer-friendly clusters.How Argo CD Handles 500+ vClusters and Where It BreaksA new deep-dive shows the scaling limits of Argo CD on a control plane managing 1,000 virtual clusters (vClusters) with GitOps. Performance remained stable up to ~500 clusters and ~500 apps, but beyond that, Argo CD controllers hit memory limits and UI became sluggish. The test highlights practical scaling ceilings and tuning tips for multi-tenant GitOps setups on Kubernetes.KubeDiagrams, the open-source tool for generating Kubernetes architecture diagrams, released v0.4.0 with a new --namespace option and improved support for custom resources. It now handles over 47 native Kubernetes types and integrates with Helm, Helmfile, and actual cluster states. This update makes it easier for platform teams to auto-document infrastructure directly from manifests or live clusters.🔍 Observability & SREOllyGarden has introduced the Instrumentation Score, a new open-source standard to measure the quality of OpenTelemetry data. It analyzes telemetry streams against best practices and semantic conventions, giving teams a clear numerical score to assess instrumentation health.A major outage on June 12, 2025, took down Google’s Identity and Access Management (IAM) system, affecting authentication across Firebase and other core services. This follows a similar 2023 incident and highlights risks of central authentication failures in serverless architectures. For cloud teams, it’s a fresh reminder of the need for multi-region failover and alternative authentication strategies.Gigapipe has introduced a fixed-cost observability platform that combines logs, metrics, traces, and profiling into a single backend. It offers compatibility with OpenTelemetry, Loki, Prometheus, Tempo, and Pyroscope without requiring custom agents. This could simplify observability stacks for cloud teams while avoiding variable usage-based costs.Dynatrace now supports querying OpenTelemetry data using natural language via its MCP server and GitHub Copilot integration. Engineers can ask conversational questions in VSCode to retrieve logs, traces, and metrics directly from Dynatrace. This can simplify querying for teams still learning DQL and improve OTel workflows without needing deep query syntax knowledge.InfraSight is a new open-source observability stack using eBPF for real-time syscall tracing on Linux and Kubernetes. It captures events like process execution, file access, and network connections, streaming data to ClickHouse for fast querying. With gRPC pipelines, Kubernetes CRDs, and Helm charts, it aims to simplify low-level infrastructure observability without application changes.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

07 Jul 2025

7 min read

Kubernetes Faces Gaps in Handling Device Failures for AI/ML Pods

Shreyans from Packt

07 Jul 2025

7 min read

Uber Cuts CI Costs by 53% Using Smarter Build PrioritizationCloudPro #98One of the few GenAI tools that actually feels built for engineersMost GenAI tools just dress up autocomplete. Shield’s AmplifAI is different. It uses agentic AI, systems that reason and act across steps, to take real work off your plate.Think: auto-surfacing hidden compliance risks, navigating tangled comms threads, explaining every decision clearly. No magic, just well-architected automation with human-in-the-loop guardrails.If you're curious what useful AI looks like in practice, start here.Learn More> Attack graphs are redefining IAM risk modeling from the ground up> Airbnb’s load testing framework bakes chaos into CI/CD> Kubernetes is still awkward with GPU failures, and no one’s fixed it yetPlus: SRE agents with $21M backing, mirrord’s new team debugging trick, and visual Kubernetes troubleshooting that finally makes sense.Cheers,Shreyans SinghEditor-in-ChiefNetwork security that just works: no apps, no frictionSecurity shouldn’t depend on whether your users remember to install something. That’s why I found Whalebone so interesting: it protects millions of devices from phishing, malware, and scams at the DNS level, no downloads required.It’s cleanly integrated, telco-ready, and surprisingly quick to deploy (2 months). Telcos like O2 and A1 are already using it to boost ARPU while quietly shielding users in the background.For teams building secure, seamless infra:Learn More🔐 Cloud SecurityWhy Default Pod Communication in Kubernetes is a Security RiskBy default, all pods in a Kubernetes cluster can talk to each other, which simplifies app deployment but opens up security risks. Network policies are the main way to restrict this traffic, using labels and namespaces to control ingress and egress. Support for policies depends on your CNI plugin: tools like Calico enable advanced rules, while others like flannel do not.Why IAM demands an Attack Graph first approachMost IAM programs start with static access lists, but attackers exploit paths, not lists. An Attack Graph shows how identities and permissions can be chained for lateral movement and takeover. By modeling these paths first, security teams can prioritize real, exploitable risks and fix what matters. This shift helps align identity security with how attacks actually happen, not just how access is managed.12-Month Cloud Security Challenge Just Dropped – Practice, Compete, and Get CertifiedWiz has launched Cloud Champions, a monthly CTF challenge series focused on real-world cloud security scenarios. Each challenge is crafted by Wiz researchers and designed to help practitioners sharpen their skills through hands-on problem-solving. The first challenge, “Perimeter Leak,” went live in June, with more slated through May 2026. A leaderboard tracks participant progress and highlights top performers.Building AI agents that hunt like cloud adversariesSecurity researchers are building AI agents that think and act like advanced cloud attackers: chaining permissions, pivoting across services, and executing real-world privilege escalation paths in AWS. These agents outperform traditional tools by reasoning contextually and automating multi-step attack logic.Simplify Kubernetes Security With Kyverno and OPA GatekeeperKyverno and OPA Gatekeeper help secure Kubernetes by blocking risky configurations before they’re deployed. Kyverno is easier to use, with YAML policies and native Kubernetes integration, while OPA Gatekeeper offers deeper flexibility using Rego for complex rules. Both tools can enforce critical security practices, like banning :latest image tags, to improve cluster safety and compliance.⚙️ Infrastructure & DevOpsUber Cuts CI Costs by 53% Using Smarter Build PrioritizationUber enhanced its SubmitQueue CI system to reduce CPU usage by 53% and cut wait times by 37% across its massive monorepos. The update uses a new probabilistic model to prioritize builds that are more likely to succeed or unblock smaller changes. This lets faster commits bypass larger ones.Figma spends $300,000 on AWS dailyFigma disclosed in its IPO filing that it now spends nearly $300,000 daily on AWS, committing to $545 million over five years. The design platform is fully dependent on AWS infrastructure and policies, highlighting vendor lock-in risks.TOP 10 DevOps Tools in 2025: Based on 300 LinkedIn job postsGitHub Actions, Terraform, Kubernetes, and ArgoCD top the list, praised for integration and power, but not without their quirks. The takeaway: there's no perfect stack, just the right mix for your team’s context and scale.mirrord Adds Queue Splitting to Enable Shared Debugging in the Cloudmirrord for Teams now supports queue splitting, letting developers work on the same service in a shared cloud environment without stepping on each other’s toes. With support for AWS SQS (Kafka and RabbitMQ coming soon), devs can apply filters so only their local app receives relevant messages. This enables real-time debugging with zero disruption to live services or teammates.📦 Kubernetes & Cloud NativeKubernetes Faces Gaps in Handling Device Failures for AI/ML PodsAs AI/ML workloads relying on GPUs become more common, Kubernetes struggles with device failure modes like partial GPU outages, degraded performance, and scheduling fragility. DIY fixes exist, but lack standardization, and core systems don’t correlate device health with pod behavior.Simplifying platform engineering at John Lewis - part one | Google Cloud BlogJohn Lewis replaced its monolithic commerce system with a multi-tenant, microservice-based architecture on Google Kubernetes Engine. A central “paved road” platform now automates provisioning, observability, and security, letting product teams deploy independently while maintaining guardrails. This approach boosts developer velocity, minimizes cognitive load, and balances consistency with flexibility as new services emerge.A visual guide on troubleshooting Kubernetes deploymentsAzure Boosts PostgreSQL Performance on AKS With Local NVMe & CloudNativePGMicrosoft now supports high-performance PostgreSQL on Azure Kubernetes Service using local NVMe via Azure Container Storage and the CloudNativePG operator. Benchmarks show up to 26,000 TPS with sub-5ms latency. For price-sensitive workloads, Premium SSD v2 offers flexible scaling and solid performance.🔍 Observability & SREAirbnb Scales Load Testing with Impulse FrameworkAirbnb developed Impulse, a decentralized load-testing framework integrated with CI/CD, to help teams test service reliability at scale. It includes a context-aware load generator, dependency mocker, traffic replay collector, and synthetic API generator for async flows.How we're building an agentic system to drive Grafana | Grafana LabsGrafana is moving beyond simple AI chat responses by building agentic systems that can reason and take action, like creating dashboards or debugging metrics, based on real-time context. Powered by the open source MCP Server, these agents interact with Grafana APIs to perform complex, multi-step workflows.Ciroos Launches AI SRE Teammate with $21M in FundingCiroos has raised $21 million to launch its AI-powered “SRE Teammate,” a multi-agent system that autonomously detects, diagnoses, and resolves incidents across cloud, Kubernetes, and networking environments. Unlike traditional observability tools, it acts like an expert partner, correlating signals and automating root-cause analysis without runbooks.Benchmarking OpenTelemetry Overhead in Go ApplicationsA recent benchmark measured the performance impact of enabling OpenTelemetry tracing in a Go app under 10,000 req/s. CPU usage rose ~35% and memory jumped from 10MB to 15–18MB, mostly due to span processing. p99 latency increased by ~5ms, and outbound telemetry added 4MB/s of network traffic.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

30 Jun 2025

8 min read

Migrating Uber’s Compute Platform to Kubernetes

Shreyans from Packt

30 Jun 2025

8 min read

How to Break Up a Terraform Terralith Without Breaking EverythingCloudPro #97All Books $9.99 | 8 Hours RemainingSHOP NOW1. AWS’s own security tool introduced a privilege escalation risk2. Terraliths slowing you down? Here's how to break them up safely3. Uber’s 3M-core migration to Kubernetes: what it really tookPlus: BitM attacks that bypass MFA, schema migration via CI/CD, and a no-fluff guide to how Kubernetes CRDs actually work.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityAWS Launches Threat Technique Catalog to Share Real-World Attack DataAWS has released the Threat Technique Catalog, a resource mapping real-world attack techniques seen in customer incidents to the MITRE ATT&CK framework. Built from AWS CIRT investigations, it includes detection and mitigation advice for tactics like token abuse and misconfigured encryption. This gives cloud defenders a practical way to strengthen their AWS environments using adversary-informed data.AWS Launches Preview of Upgraded Security HubAWS has released a preview of its revamped Security Hub, now offering integrated dashboards, exposure mapping, and attack path visualizations to better prioritize and respond to security threats. It correlates findings across GuardDuty, Inspector, Macie, and CSPM to highlight critical gaps and risks.AWS Built a Security Tool. It Introduced a Security Risk.AWS’s “Account Assessment for AWS Organizations” tool unintentionally introduced a cross-account privilege escalation risk due to insecure deployment instructions. It advised users to avoid the management account without clarifying that deploying the hub role in a less secure account could expose high-sensitivity environments. AWS has since updated its documentation to recommend using a secure account.Forgotten DNS Records Enable CybercrimeA threat actor dubbed Hazy Hawk is hijacking abandoned cloud resources, like AWS S3 buckets and Azure endpoints, through dangling DNS records. By taking over subdomains of major organizations, including CDC, Deloitte, and universities, they reroute users to scams and malware via complex traffic distribution systems. The attacks exploit subtle DNS misconfigurations and show how unmanaged cloud resources can silently expose enterprise users to persistent threats.Browser-in-the-Middle Attacks Bypass MFA to Steal Sessions in Real TimeMandiant warns of a growing threat called Browser-in-the-Middle (BitM), where attackers proxy real login pages through their own browsers to steal fully authenticated sessions, even after MFA. BitM tools like Mandiant's internal “Delusion” make this scalable and fast, bypassing traditional phishing protections. Only hardware-backed MFA like FIDO2 or client certificates can reliably block these attacks.Workshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:-The top LLM vulnerabilities-Proven best practices for securing AI-generated code-Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register today⚙️ Infrastructure & DevOpsAWS CloudTrail Adds Detailed Logging for S3 Bulk DeletesAWS CloudTrail now logs individual object deletions made via the S3 DeleteObjects API, not just the bulk operation. This gives teams clearer visibility into which files were removed, improving audit trails and helping meet compliance and security needs. Granular logs also allow finer control via event selectors.AWS Backup adds new Multi-party approval for logically air-gapped vaultsAWS Backup now supports multi-party approval for logically air-gapped vaults, allowing secure recovery even if your AWS account is compromised. Admins can assign trusted approval teams to authorize vault access from outside accounts. This provides an independent, auditable recovery path, strengthening ransomware resilience and governance for critical backups.Inside AWS’s Strategy for Building Bug-Free, High-Performance SystemsAWS shared how it integrates formal and semi-formal methods, like TLA+, model checking, fuzzing, and deterministic simulation, into everyday development to eliminate bugs, boost developer speed, and enable aggressive optimizations. Tools like the P language and PObserve are used across S3, DynamoDB, EC2, and Aurora to model distributed systems, validate runtime behavior, and prove correctness of critical code paths.How to Break Up a Terraform Terralith Without Breaking EverythingLarge monolithic Terraform setups (“Terraliths”) can slow down deploys and increase risk. This guide lays out a clean migration path, starting with dependency mapping and backups, then moving to new root modules using import and removed blocks (in TF 1.7+), or scripted state mv operations. It also covers real-world lessons on inter-module communication, safe rollouts, automation, and state isolation, helping teams modernize IaC safely and modularly.Why It’s Time to Automate Your Database Schema MigrationsMany teams automate their app deployments but still manage database changes manually, leaving room for human error, schema drift, and security risks. This guide explains how tools like Atlas bring schema migrations into your CI/CD pipelines using declarative definitions, automatic diffs, and linting. The result: safer deployments, fewer production credentials, and consistent environments.📦 Kubernetes & Cloud NativeAmazon EKS Pod Identity adds cross-account access supportAmazon EKS Pod Identity now supports cross-account resource access without code changes. You can assign a second IAM role from another AWS account when creating a pod identity, enabling secure access to resources like S3 or DynamoDB via IAM role chaining. This simplifies multi-account architectures in EKS and reduces the complexity of credential management.Amazon GuardDuty expands Extended Threat Detection coverage to Amazon EKS clustersAmazon GuardDuty now detects advanced attack sequences in EKS clusters by correlating signals across audit logs, runtime activity, and API usage. This helps uncover threats like privilege escalation and secret exfiltration that might be missed by isolated alerts. It gives security teams a complete view of Kubernetes compromises and reduces time to investigate and respond.How CRDs Extend and Hook into the Kubernetes APIThis deep dive explains how Kubernetes Custom Resource Definitions (CRDs) work behind the scenes. It walks through how CRDs register with the Kubernetes API, how schemas validate custom objects, and how controllers fetch and handle them via client-go. You’ll learn how CRDs are serialized, discovered, and routed through the aggregation layer, giving you a detailed mental model for building robust Kubernetes extensions.Migrating Uber’s Compute Platform to KubernetesUber migrated all stateless services, powering 3M+ cores and 100K daily deployments, from Mesos to Kubernetes to standardize infrastructure and tap into the cloud-native ecosystem. They tackled extreme scale (7,500-node clusters), rebuilt integrations, and automated the shift using their internal “Up” platform. Custom solutions like artifact preservation, gradual scaling, and rollout heuristics ensured reliability, while Kubernetes UI and scheduler tweaks enabled smooth operations.Stop Building Platforms Nobody Uses: Pick the Right Kubernetes Abstraction with GitOpsThis post calls out a common pitfall: over-engineering internal platforms that developers don’t adopt. It argues that real developer pain: context switching, CI/CD complexity, insecure YAML sprawl, must shape the abstraction layer. Tools like Kro and Score can simplify Kubernetes via GitOps, but only when they reduce complexity without hiding critical decisions. The message: build abstractions that solve real problems, not just tick architectural boxes.🔍 Observability & SREAmazon VPC Route Server announces logging enhancementsAWS has added new monitoring features to VPC Route Server, including real-time logs for BGP and BFD sessions, historical data tracking, and flexible delivery via CloudWatch, S3, and Firehose. This helps engineers troubleshoot connectivity issues faster without needing AWS Support.Amazon Athena adds managed query results with built-in storage and cleanupAmazon Athena now supports managed query results, eliminating the need to preconfigure S3 buckets or manually clean up old results. This simplifies analysis workflows, especially for teams using automated workgroup creation.Grepr - Dynamic ObservabilityGrepr launched an ML-powered observability pipeline that filters, aggregates, and routes telemetry data before it hits your tools, reducing log volumes and storage costs significantly. It can scale automatically, backfill data during incidents, and runs alongside existing setups with minimal config. Ideal for teams seeking cost control without losing visibility.Chip auto-detects root causes without manual alerting or dashboardsChip is a zero-config monitoring agent that auto-instruments apps and alerts only on real customer-impacting issues. It tracks everything from code commits to Kubernetes events to find root causes fast, using real-time outlier and cohort detection. Built for fast-moving teams who want signal without the noise.Parseable offers fast, open-source observability on S3 with low resource useParseable is a lightweight, S3-first observability platform designed for speed and cost-efficiency. It delivers 90% faster queries than Elastic, uses up to 70% less CPU/memory, and integrates easily with AI and observability tools. Fully open source with no vendor lock-in.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!Disclaimer: Some eBooks and videos are excluded from the $9.99 offer. For selected countries, tiered discount pricing may vary.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

23 Jun 2025

11 min read

Which call paths dominate at runtime: using Flame Graphs to visualize it!

Shreyans from Packt

23 Jun 2025

11 min read

By Kaiwan N BillimoriaCloudPro #97This week’s CloudPro is a guest special from Kaiwan N Billimoria, the author of Linux Kernel Programming. Kaiwan runs world-class, seriously-valuable, high on returns, technical Linux OS (Corporate and Individual-Online) training programs at https://kaiwantech.com.In today’s issue, Kaiwan walks us through Flame Graphs: a powerful tool to visualize which call paths dominate at runtime and uncover performance bottlenecks.If you want to go deeper, his book Linux Kernel Programming is available for just $9.99 as part of Packt’s Summer Sale.Cheers,Shreyans SinghEditor-in-ChiefGET eBOOK at $9.99P.S. If you’re into platform engineering, check out Platform Weekly: the world’s largest newsletter for platform engineers with 100,000+ readers. Subscribe here.P.P.S. DeployCon is happening June 25. An engineer-first GenAI summit featuring teams from Meta, Tinder, DoorDash, and more. Join in person at the AWS Loft SF or online. Register now.Which Call Paths Dominate at Runtime: Using Flame Graphs to Visualize it!By Kaiwan N BillimoriaAnalyzing workloads is something all engineers end up doing at some point or another (or it’s their job description!). An obvious reason is performance analysis; for example, CPU usage may spike at times, causing issues or even outages.The need of the hour: observe, analyze, and figure out the root cause of the performance issue! Of course, that’s often easier said than done; this kind of work can bog down even experienced professionals...Borrowing from Brendan Gregg’s wonderful presentation (though old, it’s still relevant):In general, answering the ‘Who’ and the ‘How’ are simple(r):‘Who?’: well-known tools like top (and its numerous variants – htop, atop, etc) help answer this question.‘How?’: lots of system monitoring tools are available (vmstat, dstat, sar, nagios, cacti, nmon, iostat, nethogs, sysmon, etc.).The harder questions tend to be the ‘Why?’ and ‘What?’:‘Why?’: by generating a Flame Graph! (the topic of this short article)‘What?’: Flame Graphs as well as plain old perf!The following slide illustrates this (again, from Brendan Gregg):Right. So what the heck’s this Flame Graph thingy? Let’s explore!We’ll abbreviate Flame Graphs as FG.There are several types of FGs (CPU, GPU, memory, off-cpu, etc.); here we keep the focus on just one: CPU FGs via Linux’s powerful perf CPU profiler.The moment a tool can generate profiling data that includes stack traces, it implies that FGs can be generated! Thus, there are several tools besides perf that generate FGs:Windows: WPA, PerfView, Xperf.exeLinux: perf, eBPF, SystemTap, ktapFreeBSD: DTraceMac OS X: InstrumentsWe’ll focus only on using Linux perf; it’s considered one of the best modern CPU profiling tools on the platformMotivation for FGsWith perf, you can indeed profile your workload and see where exactly CPU usage shoots up. It’s easy: record something, get the report, and analyze it (well… it sounds easy at least).Example:Record a system-wide profiling (-a option switch) session with stack chain / backtrace (--call-graph dwarf, old option was -g), frequency of 99 Hz, for 10 seconds:sudo perf record -F 99 -a --call-graph dwarf -- sleep 10(Instead of the -a option switch, you can use the -p PID option to profile a particular process. The generated perf.data file’s owned by root; do a chown to place its ownership under your account if you wish.)Get the perf report:sudo perf report --stdio # or --tui…(Try it!).This begs the question – so why not just use perf? Ah, that’s the thing: on non-trivial workloads, the report can be simply humongous, even going into dozens of (printed) pages! Are you really going to read through all of it, trying to spot the outliers?Visualization with the CPU Flame GraphIt’s why we use the so-called Flame Graph (FG) – to visualize dense textual data and make sense of it; it’s so much clearer (so much more humane, literally).InstallationFirst off, ensure both the perf utility and the FlameGraph scripts are installed.Quick note: to install perf on Ubuntu/Debian, you typically need to be on a distro kernel (not a custom one).Why? Because – unusually for an app – it’s tightly coupled to the kernel it runs on! Assuming you’re on an Ubuntu/Debian distro, do this: sudo apt install linux-perf-$(uname -r) linux-tools-generic (even the linux-tools-generic package might be sufficient).If you’re on a custom-built kernel, build perf (it’s easy): cd <kernel-src-tree>/tools/perf ; make .Install FG from here or do (in an empty folder):git clone --depth 1 https://github.com/brendangregg/FlameGraph.gitSteps to generate a Flame GraphProfile the workload using perf:perf record -F 99 --call-graph dwarf [-a]|[-p pid]-a: all cpus; in effect, if specified, the sample is system-wide-p: sample a particular process.Generates the perf.data binary file.Read from perf.data (default, else use -i <fname>) to convert the binary data to human-readable stack traces via perf script:perf script > perfscript_out.datGenerate the FG, a Scalable Vector Graphic (SVG) file:The FG repo includes several stackcollapse-* scripts; we use the stackcollapse-perf.pl one:cat perfscript_out.dat | FlameGraph/stackcollapse-perf.pl \ | FlameGraph/flamegraph.pl > out.svgOpen the SVG in a web browser, move the mouse over stack frames.A Quick Test RunWe’ll assume you’ve installed both perf and the Flame Graph GitHub repo (the latter under your home dir).Profile: record everything for 10ssudo perf record -F 99 -a --call-graph dwarf -- sleep 10sudo chown ${LOGNAME}:${LOGNAME} perf.dataperf script > perfscript_out.datcat perfscript_out.dat | ~/FlameGraph/stackcollapse-perf.pl |~/FlameGraph/flamegraph.pl > out.svgOpen the SVG file in a web browser. Here’s a screenshot of the Flame GraphHmm, better if we zoom in… so I click on one of the rectangles on the lower-left (say on the gnome-shell one):Ah, better.Interpreting the Flame GraphSome really key points regarding how to interpret the Flame Graph:Each rectangle represents a single stack frame; read it bottom-up.The width is representative of the frequency of the function call.The height is representative of the depth of the stackThe order of rectangles from left-to-right is just alphabetical; it's not a timeline.The colors don’t signify anything special.You can (typically) use the browser Search (Ctrl-F) to search for a function by name.Click on a stack frame (a rectangle) to zoom into that tower. Click Reset Zoom (upper-left corner) to zoom back out.In effect: the hottest code-paths – the ones that dominate - are the widest rectangles!The top-edge – the rectangle at the very top - is the function on-CPU; beneath is ancestry (how it was invoked).Here’s another FG I captured while SSH was running (truncated screenshot showing the interesting portion):Interesting; the “towers” seem to be inverted! Yes, they’ve becomes top-down (downward-growing stacks) instead of bottom-up… they’re called icicles!An option to the perf script command sets this up.A fantastic thing about the FG is that both userspace and kernel-space functions are captured! It’s thus called a mixed-mode FG. For e.g., with the ‘ssh’ FG, you can clearly see the call path leading down to the kernel network protocol stack code – functions from the socket/INET layer sock_*(), followed by L4 tcp_*(), followed by the L3 ip_*() functions; even the invocation of the (network) device transmit – the dev_hard_start_xmit() and others – are visible!My flamegrapher.sh wrapper scriptsNext, to make this a bit easier to use (no need to remember the syntax, easier options), I wrote a wrapper over the original Flame Graph scripts; the top-level one’s named flamegrapher.sh: https://github.com/kaiwan/L5_user_debug/tree/main/flamegraph (it forms a portion of my ‘Linux Userspace Debugging – Tools & Techniques’ training repo).It’s Help screen reveals how you can – very easily! – use it to generate FGs:$ ./flame_grapher.shUsage:flame_grapher.sh -o svg-out-filename(without .svg) [options ...]-o svg-out-filename(without .svg): name of SVG file to generate (saved under /tmp/flamegraphs/)Optional switches:[-p PID]: PID = generate a FlameGraph for ONLY this process or threadIf not passed, the *entire system* is sampled...[-s <style>]: normal = draw the stack frames growing upward [default]icicle = draw the stack frames growing downward[-t <type>]: graph= produce a flame graph (X axis is NOT time, merges stacks) [default]Good for performance outliers (who's eating CPU? using max stack?); works well for multi-threaded appschart= produce a flame chart (sort by time, do not merge stacks)Good for seeing all calls; works well for single-threaded apps[-f <freq>]: frequency (HZ) to have perf sample the system/process at [default=99]Too high a value here can cause issues-h|-?: show this help screen.Note:After pressing ^C to stop, please be patient... it can take a while to process.The FlameGraph SVG (and perf.data file) are stored in the volatile /tmp/flamegraphs dir; copy them to a non-volatile location to save them.Notice a few points:The only mandatory option switch is -o fname; it generates an SVG file named fname.svg.There are two ‘types’ of FG’s we can generate:graph [default]: Produce an FG (X axis is NOT time, merges stacks). This type’s good for performance outliers (who's eating CPU? using max stack?); works well for multi-threaded apps.chart : Produce a flame chart – it’s sorted by time, do not merge stacks. Good for seeing all calls; works well for single-threaded apps.You can optionally specify a particular process (by -p PID) to profile, change the style to icicle, and set the profiling frequency.The metadata and the SVG is stored under /tmp; copy it to a non-volatile location if you want it saved!(Do read README.md as well. Hey, this wrapper’s lightly tested; please help me (and everyone!) out by raising Issues, as and when you come across them!)Tip: Try the speedscope.app site to interact with your FlameGraph!Flame Graphs: Caveats/IssuesFrame Pointers being present helps get good stack traces, BUT the -fomit-frame-pointer is the typical GCC flag passed!Possible exception case is the Linux kernel itself; it has intelligent algorithms to emit accurate stack trace even in the absence of frame pointers.Symbols are required (can use a separate symbol file). A side effect of no symbols may be ill-formed (or close to zero) stack traces.VMs may not support the PMCs (performance measurement counters) that perf requires; in that case, FGs (or perf) don’t really work well.Bonus materialB Gregg’s Linux Performance Observability Tools diagram across the stack!TipsWith [e]BPF becoming a powerhouse for many things, including observability, do look up equivalent eBPF tooling as well: https://www.brendangregg.com/ebpf.html (a similar diagram’s here!).Also be sure to check out B Gregg’s (and others) utility package wrappers: perf-tools and bpfcc-tools.Don’t ignore systemd’s systemd-analyze tool (boot-time).Perf: simply running sudo perf top is itself useful to find outliers; I keep a couple of aliases as well:alias ptop='sudo perf top --sort pid,comm,dso,symbol 2>/dev/null'alias ptopv='sudo perf top -r 80 -f 99 --sort pid,comm,dso,symbol \--demangle-kernel -v --call-graph dwarf,fractal 2>/dev/null'GET Linux Kernel Programming at $9.99What did you think of this special issue📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

16 Jun 2025

9 min read

How to Make Sure Your Kubernetes Sidecar Starts Before the Main App

Shreyans from Packt

16 Jun 2025

9 min read

Why Automatic Rollbacks Are Risky and Outdated in Modern DevOpsCloudPro #96Platform Weekly - the world’s largest platform engineering newsletterWith over 100,000 weekly readers Platform Weekly dives into platform engineering best practices, platform engineering news, and highlights, lessons and initiatives from the platform engineering community.Subscribe Now📌 A hidden prompt injection flaw in GitLab Duo that quietly leaked source code📌 Just-in-time AWS access using Entra PIM (yes, that’s possible now)📌 Cloud SQL charging 2TB storage for 6GB of data, because of WAL logs📌 Why automatic rollbacks in DevOps might be doing more harm than goodYou’ll also find sharp reads on scaling Terraform teams, new volume tools for AI/ML in GKE, and a brutally honest take on Kubernetes complexity. On the observability side, AWS added visual dashboards to Network Firewall, and OpenTelemetry clarified how to treat logs vs. events.Hope you find something that helps you ship safer, smarter, or faster.Cheers,Shreyans SinghEditor-in-ChiefPS: If you’re not already reading Platform Weekly, I’d recommend it.It’s one of the few newsletters I make time for every week: focused on platform engineering, cloud native, and the kind of problems teams actually face. 100,000+ people read it, but it still feels like it’s written by someone who gets it.Here’s the link if you want to check it outSubscribe Now🔐 Cloud SecurityJust-in-time AWS Access to AWS with Entra PIMJust‑in‑time privileged access can be implemented by integrating Microsoft Entra PIM with AWS IAM Identity Center using SCIM/SAML, enabling temporary group-based access tied to approval workflows and time limits. By mapping Entra security groups to AWS permission sets (e.g. EC2AdminAccess) and enabling eligibility/activation in PIM, users gain access only when approved, and only for a set duration.On‑Demand Rotation Now Available for KMS Imported KeysAWS KMS now lets you rotate imported symmetric key material on‑demand without needing to create a new key or change its ARN, simplifying compliance and security by avoiding workload disruptions. New API operations, including RotateKeyOnDemand and KeyMaterialId tracking, let you import, rotate, audit, expire, or delete individual key versions while retaining decryption access to older ciphertext.CloudRec: multi-cloud security posture management (CSPM) platformCloudRec is an open‑source, scalable CSPM platform that continuously discovers 30+ cloud services across AWS, GCP, Alibaba, and more, offering real‑time risk detection and remediation.It uses OPA‑based declarative policy management, enabling dynamic, flexible rule definitions without code changes or redeployment.How to use the new AWS Secrets Manager Cost Allocation Tags featureAWS Secrets Manager now supports cost allocation tags, letting you tag each secret (e.g., with CostCenter) and track its costs in Cost Explorer or cost-and-usage reports.Enable tags in Billing → Cost Allocation Tags, then filter or group secrets costs by tag to see spend per department or project.GitLab Duo Prompt Injection Leads to Code and Data ExposureA hidden prompt injection flaw in GitLab Duo allowed attackers to embed secret instructions, camouflaged in comments, code, or MR descriptions, triggering the AI assistant to reveal private source code. The attacker leveraged streaming markdown rendering and HTML injection (like <img> tags) to exfiltrate stolen code via base64-encoded payloads. GitLab patched the vulnerability in February 2025, blocking unsafe HTML elements and tightening input handling.⚙️ Infrastructure & DevOpsAmazon API Gateway introduces routing rules for REST APIsAmazon API Gateway now supports routing rules for REST APIs on custom domains, allowing dynamic routing based on HTTP headers, URL paths, or both. This enables direct A/B testing, API versioning, and backend selection, removing the need for proxies or complex URL structures.Amazon EC2 now enables you to delete underlying EBS snapshots when deregistering AMIsEarlier, snapshots had to be removed separately, often leading to orphaned volumes and wasted spend. Now. AWS EC2 will let users automatically delete EBS snapshots when deregistering AMIs, cutting down on manual cleanup and storage costs. This update streamlines resource management with no extra cost and is available across all AWS regions.Why is your Google Cloud SQL bill so high?A developer discovered that their Cloud SQL instance showed 2 TB of usage for only 6 GB of actual data, due to retained Write-Ahead Logs (WAL) from Point-in-Time Recovery. These logs can silently bloat storage costs when frequent transactions occur. To control costs, users should reduce WAL retention or re-provision instances with right-sized storage.Why Automatic Rollbacks Are Risky and Outdated in Modern DevOpsAutomatic rollbacks seem helpful but often fail due to the same issues that break deployments, like expired credentials or partial database changes. Modern practices like Continuous Delivery and progressive deployment (canary, blue/green, feature flags) offer safer, faster recovery paths. Human oversight adds resilience and learning, making manual intervention more effective than rollback automation.How to structure Terraform deployments at scaleAt scale, Terraform deployments require a clear structure that balances control and team autonomy. Scalr’s two-level hierarchy: Account and Environment scopes, lets central DevOps manage policies and modules, while engineers deploy independently within isolated workspaces. This setup encourages reusable code and standardization through a shared module registry.📦 Kubernetes & Cloud NativeMaking Kubernetes Event Management Easier with Custom AggregationAs Kubernetes clusters grow, managing events becomes harder due to high volume, short retention, and poor correlation. This article shows how to build a custom event system that groups related events, stores them longer, and spots patterns: helping teams debug issues faster. It uses Go to watch, process, and store events, and includes options for alerts and pattern detection.GKE Volume Populator Simplifies AI/ML Data Transfers in KubernetesGoogle Cloud’s new GKE Volume Populator helps AI/ML teams automatically move data from Cloud Storage to fast local storage like Hyperdisk ML, no custom workflows needed. It uses Kubernetes-native PVCs and CSI drivers to manage transfers, delays pod scheduling until data is ready, and supports fine-grained access control.How to Make Sure Your Kubernetes Sidecar Starts Before the Main AppIf your app depends on a sidecar, Kubernetes doesn’t guarantee the sidecar is fully ready before the main container starts, even with the new native support. This article shows how to delay the app start using startupProbe or postStart hooks in the sidecar. These methods let the app wait until the sidecar is actually ready, avoiding startup errors without needing code changes.Not every problem needs KubernetesKubernetes promises scalability and flexibility, but for most teams, it adds unnecessary complexity. Many workloads can be handled more easily with VMs, managed cloud services, or simpler container platforms like AWS Fargate or Google Cloud Run. Unless you truly need hybrid cloud, global scale, or run hundreds of services, Kubernetes may just slow you down and drain resources.What You Actually Need for Kubernetes in ProductionProduction Kubernetes setups need more than just working clusters. Use readiness, liveness, and startup probes correctly to avoid early traffic issues or restarts. Always define CPU and memory limits, isolate secrets using volumes, and enforce RBAC with least privilege. Use HPA for scaling, avoid local storage, and apply network policies to control traffic. Tools like kube-bench, Trivy, and FluentBit help monitor security, cost, and logs effectively.Book Now🔍 Observability & SREAWS Network Firewall launches new monitoring dashboardAWS Network Firewall now includes a monitoring dashboard that shows key traffic patterns like top flows, TLS SNI, HTTP host headers, long-lived TCP flows, and failed handshakes. This helps teams troubleshoot issues and spot security concerns faster. It’s available in all supported regions at no extra firewall cost, but requires Flow and Alert logs to be configured.Official RCA for SentinelOne Global Service InterruptionSentinelOne’s May 29 global service outage was caused by a software flaw in a deprecated infrastructure control system, which accidentally deleted critical network routes. This broke internal connectivity, taking down management consoles and related services. While customer endpoints stayed protected, teams lost visibility and control during the incident.There's a Lot of Bad Telemetry Out ThereMuch of today’s telemetry is noisy, irrelevant, or misleading: causing higher costs, slow troubleshooting, and poor decisions. Common problems include incomplete traces, outdated metrics, irrelevant logs, and data overload. Engineers often lack clear standards or guidance on good telemetry, especially for newer systems like LLMs. To fix this, teams should define what's useful, apply consistent conventions (e.g. OpenTelemetry), and work closely with devs to improve instrumentation at the source.OpenTelemetry Clarifies Its Approach to Logs and EventsOpenTelemetry treats logs as structured records sent through its Logs API, with a special focus on events: logs with a defined schema and guaranteed structure. Events are preferred for new instrumentation, as they integrate with context and can correlate with traces and metrics. Unlike spans, events have no duration or hierarchy. OpenTelemetry recommends using logs mainly for bridging existing systems, while semantic instrumentation should rely on events for consistency and context sharing.Storing all of your observability signals in one place matters!Treating traces, logs, and metrics as separate “pillars” creates silos and hinders correlation. Many teams still split signals across tools or vendors, leading to fragmented insights and painful debugging. A centralized “single pane of glass” setup helps correlate signals in one place, making it easier to understand system behavior.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

09 Jun 2025

9 min read

Uber built a multi-cloud secrets platform to prevent leaks and automate security at scale

Shreyans from Packt

09 Jun 2025

9 min read

How to Block Up to 95% of Attacks Using AWS WAFCloudPro #95A better way to handle vendor security reviews?If you've ever dealt with vendor onboarding or third-party cloud audits, you know how painful it can be: long email chains, stale spreadsheets, and questionnaires that don’t reflect what’s actually happening in the cloud.We recently came across CloudVRM, and it’s a refreshingly modern take on the problem.Instead of asking vendors to fill out forms or send evidence, CloudVRM connects directly to their AWS, Azure, or GCP environments. It pulls real-time telemetry every 24 hours, flags misconfigs, and maps everything to compliance frameworks like SOC 2, ISO 27001, and DORA.It’s already being used by banks and infra-heavy orgs to speed up vendor approvals by 85% and reduce audit overhead by 90%.Worth checking out if you're building or maintaining systems in regulated environments, or just tired of spreadsheet security.Watch the demoThis week’s CloudPro kicks off with something genuinely useful: a tool that replaces vendor security questionnaires with real-time cloud evidence.📌CloudVRM connects directly to AWS, Azure, or GCP and auto-checks compliance, no spreadsheets, no guesswork📌AWS CloudTrail silently skipping logs if IAM policies get too large (and attackers know it)📌PumaBot is now brute-forcing IoT cameras and stealing SSH credsWe’ve also got sharp engineering writeups: from how Uber rotates 20K secrets a month, to how Netflix handles 140 million hours of viewing data daily, to one team’s story of slicing a $10K Glue bill down to $400 with Airflow.Hope you find something in here that saves you time, money, or migraines.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityAWS CloudTrail logging can be bypassed using oversized IAM policiesResearchers at Permiso Security found that AWS CloudTrail fails to log IAM policies between 102,401 and 131,072 characters if they're inflated using whitespace. This gap allows attackers to hide malicious changes from audit logs. The issue stems from undocumented size limits and inconsistent handling of policy data. AWS has acknowledged the problem and plans a fix in Q3 2025.PumaBot targets Linux-based IoT surveillance devices via SSH brute forceA new botnet called PumaBot is targeting IoT surveillance systems by brute-forcing SSH access using IP lists from its command-and-control server. Written in Go, the malware disguises itself as system files, adds persistence through systemd, and installs custom PAM modules to steal credentials. Related binaries in the campaign also auto-update, spread across Linux systems, and exfiltrate login data.How to Block Up to 95% of Attacks Using AWS WAFThis guide explains how to configure AWS Web Application Firewall (WAF) to block threats like SQL injection, XSS, bots, and DDoS attacks with minimal effort. By leveraging pre-built managed rules and setting up a Web ACL, users can protect apps behind ALB, CloudFront, or API Gateway without custom code.CloudPEASS: Toolkit to find and exploit cloud permissions across AWS, Azure, and GCPCloudPEASS helps red teamers and defenders map out permissions in compromised cloud accounts without modifying resources. It supports AWS, Azure, and GCP, detecting privilege escalation paths using API access, brute-force permission testing, and AI-assisted analysis. It also checks Microsoft 365 services in Azure and enables Gmail/Drive token access in GCP.Uber built a multi-cloud secrets platform to prevent leaks and automate security at scaleTo manage over 150,000 secrets across services and vendors, Uber developed a centralized secrets management platform. It blocks leaks in code with Git hooks, scans systems in real time, and consolidates 25 vaults into 6. The platform enables auto-rotation, access tracking, and third-party secret exchange via SSX. It now rotates ~20,000 secrets monthly and is evolving toward secretless auth and workload identity federation.BOOK NOW AT 25% OFF⚙️ Infrastructure & DevOpsAWS Cost Explorer now offers a new Cost Comparison featureAWS launched a new Cost Comparison feature in Cost Explorer that highlights key changes in cloud spend between two months. It automatically identifies top cost drivers, like usage shifts, discounts, or refunds, without needing manual spreadsheets. A new “Top Trends” widget shows the biggest changes at a glance, and deeper insights are now available through the Compare view.Go-based Git Add Interactive tool adds advanced staging and patch filteringThis Go port of git add -i/-p enhances Git’s interactive staging with features like global regex filters, auto-hunk splitting, and multi-mode patch operations (stage, reset, checkout). It supports keyboard shortcuts, color-coded UI, and fine-grained hunk control across all files.GitLab-based monorepo streamlines Terraform module versioning and securityThis setup uses a GitLab CI pipeline to manage Terraform modules in a monorepo, with automated versioning, linting, and security scans via tools like TFLint, tfsec, and Checkov. Git tags handle module versions without extra auth tokens. The workflow enforces changelogs, labels, and approvals, and publishes docs and tags post-merge.A fully automated fix for Terraform’s backend bootstrapping problem on AzureThis guide solves the common issue where Terraform needs a backend to store state, but can’t create it without an existing backend. It automates the creation of an Azure Blob backend using Terraform itself, then seamlessly switches to that backend by generating partial config files and migrating state. The setup includes secure access via managed identity and GitHub OIDC, enabling CI/CD workflows without manual secrets or scripts.Using Terraform to automate disaster recovery infrastructure and failoversThis post explains DR strategies like Pilot Light and Active/Passive, and shows how Terraform enables flexible, cost-efficient deployments using conditionals and modular IaC. A working AWS example demonstrates DNS failover and dynamic EC2 provisioning using a toggle variable. This lets teams switch between production and DR environments with minimal effort, reducing downtime and idle resource costs.📦 Kubernetes & Cloud NativeGateway API v1.3.0 Adds Smart Mirroring and New Experimental ControlsGateway API v1.3.0 is now GA with percentage-based request mirroring, letting teams test blue-green deployments without full traffic duplication. The release also debuts experimental support for CORS filters, retry budgets, and listener merging via new X-prefixed APIs. These features help fine-tune request handling, scale listener configs across namespaces, and manage retry spikes, without upgrading Kubernetes itself.Introducing Gateway API Inference ExtensionThe new Gateway API Inference Extension introduces model-aware routing for GenAI and LLM services running on Kubernetes. It adds InferenceModel and InferencePool resources to better match requests with the right GPU-backed model server based on real-time load. Early benchmarks show reduced latency under heavy traffic compared to standard Services, helping ops teams optimize resource usage and avoid contention.Deep Dive into VPA 1.3.0: Smarter Resource Tuning for Kubernetes PodsThis post explores how the Vertical Pod Autoscaler (VPA) v1.3.0 uses historical and real-time metrics to recommend CPU and memory resource requests. It focuses on the Recommender component, which aggregates usage into decaying histograms to auto-tune workloads and reduce resource waste.Default Helm Charts Leave Kubernetes Clusters at RiskMicrosoft researchers warn that many open-source Helm charts deploy with insecure defaults, exposing services like Apache Pinot, Meshery, and Selenium Grid to the internet without proper authentication. These misconfigurations often include LoadBalancers or NodePorts with no access controls, making them easy targets for attackers. Teams should avoid "plug-and-play" setups and review YAML/Helm configs before deploying to production.Batch Scheduling in Kubernetes: YuniKorn vs Volcano vs KueueKubernetes lacks native support for batch workloads like ML training and ETL jobs, prompting the rise of tools like Apache YuniKorn, Volcano, and Kueue. YuniKorn replaces the default scheduler with strong multi-tenancy support; Volcano focuses on high-performance use cases with gang scheduling; and Kueue integrates natively to manage job queues without altering core scheduling.🔍 Observability & SREWhat's new in Grafana v12.0Grafana v12.0 introduces Git-based dashboard versioning, dynamic layouts, and experimental APIs for managing observability as code. Drilldowns for metrics, logs, and traces are now GA, enabling queryless deep dives across signals. SCIM support simplifies team provisioning, and a new “Recovering” alert state reduces flapping.Sentry Launches Logs in Open Beta to Boost Debugging ContextSentry now supports direct log ingestion in open beta, letting developers view application logs alongside errors and traces in a single interface. This integration adds vital context, like retry attempts or upstream responses, to help identify root causes faster without switching tools.How to use Prometheus to efficiently detect anomalies at scaleGrafana Labs has built and open-sourced an anomaly detection system using only PromQL: no external tools or services required. It computes dynamic bands using rolling averages, standard deviation, and seasonal patterns, with tunable sensitivity and smoothing to reduce false positives. The framework scales across tenants and works with any Prometheus-compatible backend, making it easy to plug into SLO-based alerts for better incident context.Beyond API uptime: Modern metrics that matterTraditional uptime checks fall short in today’s fast-paced environments where even minor API delays can cause major user churn. Catchpoint’s Internet Performance Monitoring (IPM) combines global synthetic tests, percentile-based metrics, and user-centric objectives to detect slowdowns before they escalate. With features like API-as-code, chaos engineering, and CI/CD integration, IPM helps teams catch latency issues early and simulate real-world failures.Microservices Monitoring: Metrics, Challenges, and Tools That MatterMonitoring microservices requires more than just uptime: it demands insight into latency, throughput, error rates, resource use, and inter-service communication. Tools like Middleware, Prometheus-Grafana, and Dynatrace help track these metrics at scale, support alerting, and simplify root cause analysis. Best practices include centralized logging, distributed tracing, automation, and continuous optimization to maintain performance in complex distributed systems.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

26 Sep 2025

2 min read

24 Hours Left: AI Powered Platform Engineering Workshop starts tomorrow

Shreyans from Packt

26 Sep 2025

2 min read

Your last chance to join. Hurry!24 Hours Left - AI Platform Engineering Workshop starts tomorrow! This is your last chance to join.FINAL CALL: 24 Hours RemainingUse code FINAL40 to get 40% OffThis is your last chance to join 4 industry-leading instructors in this live 5-hour intensive workshop:George Hantzaras:Director of Engineering, MongoDBAjay Chankramath:Founder & CEO, PlatformetricsDr. Gautham Pallapa:Principal Director, Cloud, Data & AI, ScotiabankMax Körbächer:Founder, Liquid ReplyFellow attendees already secured their spots:Principal Product Manager, Walmart Global TechTechnical Manager, Mercedes BenzSystems Engineer, eBayStaff DevOps Engineer, FlixSenior Software Engineer, NBCUniversalFINAL CALL: 24 Hours RemainingUse code FINAL40 to get 40% OffWhat you'll walk away with:AI-driven platform frameworks for immediate useYour personalized 90-day roadmapFree ebook: Mastering Enterprise Platform Engineering ($31.99 value)Exclusive community access with all speakers & attendeesExclusive offer on CNPA, in partnership with the Linux FoundationTomorrow, September 27 | 9am - 2pm EDTDon't let this opportunity slip away. Secure your spot now, at 40% off.We’d be delighted to see you there.From all of us at PacktFINAL CALL: 24 Hours RemainingUse code FINAL40 to get 40% OffGet the Bundle for $18*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

14 Oct 2025

8 min read

How MCP Turns Your IDP into an Actual Teammate

Shreyans from Packt

14 Oct 2025

8 min read

Tired of TicketOps? Here’s how to break free. CloudPro #109: How MCP Turns Your IDP into an Actual Teammate Your runbooks live in wikis. Your SLOs live in monitoring dashboards. Your approval workflows live in ticketing systems. And you, the developer who actually needs to ship things, you're stuck stitching all of this together manually, one step at a time. This is where most cloud teams still operate. And it's exhausting. I realized this after our AI-Powered Platform Engineering workshop last week. If you were there, you know it was packed. We had George Hantzaras (Director of Engineering, MongoDB) and Ajay Chankramath (Founder, Platformetrics)up there walking through how AI, golden paths, and internal developer platforms are completely reshaping the way modern teams ship software. The energy in the session was incredible. And the conversations didn’t stop when the session ended. But here’s the thing that stuck with me the most: MCP (Model Context Protocol) turns your platform into something an AI can actually do things with, not just talk about. This is the difference between TicketOps and brittle runbooks versus workflows like provisioning, rolling back, and scaling that become assistant-triggered actions with guardrails and full visibility into what’s happening. Your AI goes from being a really smart search engine to being an actual teammate. That’s the jump I think a lot of us have been waiting for. So in this issue, I’m going to dig into what that actually looks like, and give you a practical starter pack you can adapt to your own stack. Because if you’ve been wondering how to move from AI as “a chat window” to AI as “a teammate who actually does the work,” this is a good place to start. Cheers, Shreyans Singh Editor-in-Chief Bookmark This Article for Later Join Snyk on October 22, 2025 at DevSecCon25 - Securing the Shift to AI Native Save Your Spot The Problem We’re All Facing Let me paint a picture of how most cloud teams still operate today. You write a ticket. You wait for approvals. You hunt through runbooks to find the right commands to copy-paste. You watch CI/CD spin up. You cross your fingers that the safety nets catch anything that could go wrong. Documentation and internal portals help you find things, sure. But they rarely do things. That gap, between knowing what to do and actually doing it, creates this constant context switching, slow feedback loops, and inconsistent safety guardrails. The result is messy. Really messy. Brittle handoffs. Workflows duplicated across teams. And what should be “golden paths” that empower teams end up becoming “golden cages” that block velocity instead of enabling it. Here’s what I think is the real core issue: we’ve separated knowledge from action. Your runbooks live in wikis. Your SLOs live in monitoring dashboards. Your approval workflows live in ticketing systems. And your developers, the people who actually need to ship things, they’re stuck stitching all of this together manually, one step at a time. This friction doesn’t just slow shipping down. It creates inconsistency. One team deploys with proper approvals; another skips them. One team checks SLO impact before scaling; another scales blind. You end up with fragmented practices across your org, and nobody likes that. We need a different model. Instead of “read the docs, then run the steps,” we should be asking our systems: “propose a plan, then execute it with guardrails.” Here’s What Changed: The solution rests on two foundational ideas working together, and honestly, the way they complement each other is elegant. First: golden paths. These reduce decision fatigue without sacrificing flexibility. Think of a golden path as an end-to-end workflow with sensible defaults, something like: stateless service → preview environment → automated checks → production. These paths capture your team’s collective knowledge about how to do things safely. But they’re not straightjackets. You build in escape hatches for the 20% of cases that don’t fit the mold. Golden paths let you standardize without fossilizing your processes, and they’re the antidote to “TicketOps.” Second: the Model Context Protocol (MCP). This is the glue that makes it all work. MCP is a standard that exposes your golden paths as tools that an AI assistant can actually call. Think of it as an adapter layer between your assistant and your platform. The assistant can observe your systems: what services exist, what their SLOs are, what incidents are open right now. It can propose a plan (“I’ll deploy this to preview, run tests, then promote to prod”). And it can execute approved actions while logging everything for audit and observability. MCP turns your platform into a set of composable, auditable operations. Here’s what this looks like in the real world. An engineer types: “Spin up a preview environment and load our staging data.” The assistant gathers context. What’s the service’s dependency tree? What are the current SLO baselines? Who owns this service? What recent incidents or postmortems should I know about? It drafts a plan that respects your runbooks and standards. It shows you the diffs and policy checks that apply. Then it either executes immediately (low-risk actions) or routes to the right approver (medium/high-risk actions). Everything gets logged. This isn’t theoretical anymore. It works. Bookmark This Article for Later How You Actually Build This The architecture starts simple: pick one golden path. One high-value workflow, like service deployment, that you want to standardize. Next, you build MCP servers that act as adapters to your existing systems. Your runtime layer exposes Kubernetes operations (deploy, scale, health checks) and feature flags as callable tools. Your delivery layer connects to Git, GitHub Actions, or GitLab CI to handle builds, promotions, and canary deployments. Your reliability layer taps into SLO systems, alerting platforms, and observability tools so the assistant can query metrics, create SLOs, and mute alerts with proper context. Your change control layer enforces approvals and integrates with compliance workflows. Underneath all of this runs retrieval-augmented generation (RAG). Before the assistant proposes any action, it indexes and searches your runbooks, compliance standards, postmortems, and architectural guidelines. The plan that comes back doesn’t just say “deploy the service”, it references your procedures. “Deploy using our standard blue-green pattern [link to runbook]. SLO impact: +2% latency at p99 (within budget). Dependencies: notify the data team [link to postmortem]. Policy check: OK, proceed.” Safety gets enforced as policy-as-code, not friction. You define risk tiers: low-risk actions (like querying logs) auto-execute. Medium-risk actions (like scaling a service) require human approval. High-risk actions (like deleting a database) are blocked unless routed through a formal approval path. Standards violations get auto-fixed when confidence is high, or escalated when not. Everything is logged. The user experience ties it all together. An engineer uses chat or a slash command to describe what they need. The assistant gathers context from your systems and knowledge base, drafts a plan with full diffs and impact estimates, shows what policy checks apply, and executes: staying in the loop for decisions that matter (approvals, tradeoffs) but handling the rote work. Everything that happened gets logged: the initial request, the plan, who approved it, what actually ran, what the outcome was. The Real Payoff Over time, this architecture unifies your platform. Your internal developer platform (IDP) becomes more than a portal where people click and read documentation. It becomes an actor that moves with your engineers, understanding context, respecting guardrails, and turning workflows into outcomes. You measure it differently too. Not by pageviews to your portal, but by lead time to production. By mean time to recovery when things break. By consistency of safety across teams. You tie actions to business impact: AI-assisted scaling and predictive capacity planning can deliver up to 25% cost optimization while keeping SLOs intact. The shift from ticket-based workflows to action-based workflows with AI assistance and policy-driven safety doesn’t happen overnight. Start with one golden path. Build the MCP adapters for one system at a time. Add guardrails incrementally. But the direction is clear. Push intelligence and safety down into your platform so that doing the right thing becomes the easy thing. And the hard work, decision-making, tradeoff analysis, exception handling, stays where it belongs: with your engineers. That’s where we’re headed. And honestly? I think it’s going to change how we all work. 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans Singh

15 Sep 2025

3 min read

Don’t miss this: AI-Powered Platform Engineering workshop (Sept 27)

Shreyans Singh

15 Sep 2025

3 min read

Hands-on workshop + expert panel. Special offer for CloudPro readers.CloudPro #107: Special IssueI'm interrupting our regular newsletter schedule today because something came up that I genuinely think you need to know about.I want to tell you about our event on September 27th that could be a game-changer for how you think about platform engineering.MongoDB's Director of Engineering shows you how to build AI-powered developer platforms that actually work at scaleExclusive 40% Off for CloudPro SubscribersUse code CLOUDPROHere's why I'm personally excited about this:We've got George Hantzaras from MongoDB leading a 5-hour intensive on AI-Powered Platform Engineering. And when I say intensive, I mean it – this isn't another surface-level "AI is the future" talk. George is the Director of Engineering at MongoDB, speaks at Kubecon and HashiConf, and he's going deep into the practical stuff that actually matters.Agenda for the workshop:Self-Service Golden Paths – build workflows that reduce friction while keeping developer flexibilityKnowledge as a Platform Capability – embed organizational knowledge with AI (RAG, context modeling)Intelligent Developer Portals – natural language interfaces and scaffolding services that understand developer needsAI-Driven Operations – anomaly detection, observability, and incident triage beyond traditional monitoringWhy this matters for your daily work:If you’re working with monitoring stacks like Prometheus or Grafana, George’s approach to integrating runbooks, standards, and service catalogs into developer workflows will feel directly applicable.Exclusive 40% Off for CloudPro SubscribersUse code CLOUDPROOur Panelists:We're not just doing sessions. We've also put together a panel:Ajay Chankramath – Founder, PlatformetricsDr. Gautham Pallapa – Principal Director, Cloud, ScotiabankMax Körbächer – Founder, Liquid ReplyTogether, they’ll unpack the real-world challenges and production patterns they’re seeing across industries.How You’ll Leave PreparedGeorge is ending the day with something I've never seen at these events – a structured workshop to draft your actual 90-day pilot plan. You're walking out with a personalized roadmap, not just ideas.Why This Event is DifferentFocuses on implementation, not hypeGives you time to go deep (5 hours, not 50 minutes)Ends with an actionable plan, not just slide decksExclusive 40% off for CloudPro subscribersExclusive 40% Off for CloudPro SubscribersUse code CLOUDPROBest,ShreyansEditor-in-Chief, CloudPro📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

03 Nov 2025

9 min read

AI That Runs Entirely Offline: How to Build an Offline Enterprise Assistant

Shreyans from Packt

03 Nov 2025

9 min read

By Saurabh ShrivastavaCloudPro #11082% of data breaches happen in the cloudThe reality is you can’t stop every single attack so survival depends on how fast you can recover.Join us for the Cloud Resilience Summit on December 10th to:- Build true cyber resilience by shifting to an “assume breach” strategy- Gain practical, real-world cloud insights- Ensure rapid business recovery and minimal financial impact with a cloud restoration strategySave My SpotThis week’s CloudPro Special comes from Saurabh Shrivastava, Global Solutions Architect Leader at AWS and author of the bestselling Solutions Architect’s Handbook. With over two decades in the industry, Saurabh has helped shape how enterprises build and secure cloud systems.And in today's article, he explores a radical idea: AI that runs entirely offline. No APIs, no data leaving your network. Just private, local intelligence built for sensitive environments. Sounds interesting? Read on for the full article.If you want to learn directly from him, Saurabh is hosting a live AWS Solutions Architect Associate (SAA-C03) Workshop on January 17. Its a hands-on, fast-paced session that strips the exam down to what really matters. CloudPro readers get an exclusive 40% early-bird discount with the code CLOUDPRO. Reserve your seat.Cheers,Shreyans SinghEditor-in-ChiefEarly Bird Offer: Get 40% OffUse code CLOUDPROAI That Runs Entirely Offline: How to Build an Offline Enterprise AssistantBy Saurabh ShrivastavaWorking in defense, finance, law, or a heavily regulated industry means you can't just plug into ChatGPT and call it a day. Cloud-based AI tools aren't built for environments where data leakage isn't just bad.It's catastrophic.You can't send classified intel or proprietary financial models to someone else's servers. And if you're operating in an air-gapped network? Forget about it.That's the problem this Offline Enterprise Assistant solves.It's a local AI setup that runs entirely on your own hardware. No cloud dependencies. No API keys. No data leaving your perimeter. You choose a model: LlamaCpp, Ollama, whatever fits your needs, and run it directly on your machine. Every prompt, every response, every log file stays inside your infrastructure.This matters when you're reviewing sensitive legal contracts, running R&D analyses, or automating workflows that involve confidential information. You get the productivity boost of modern AI without opening the door to external risk. It's built for teams that need full control over their tools and can't afford to trust a third party with their data.Why This Architecture Stands OutRuns Without Internet: Operates 100% offline, making it ideal for air-gapped networks or classified infrastructure.Keeps Data on Your Device: Nothing is sent out, nothing is tracked. You stay in control always.Fast and Responsive: Local inference means no lag, no rate limits, just amazing performance.Built for Sensitive Workflows: Legal reviews, research, compliance, internal tooling are all handled securely.Most teams are realizing that AI doesn’t always belong in the cloud. When you’re dealing with internal systems, sensitive data, or strict compliance rules, you need something that stays inside your walls. That’s where a local-first approach makes sense: it gives you the benefits of AI without the exposure.This Offline Enterprise Assistant is built around that idea. It’s your own assistant, running entirely on your hardware, tuned to your environment, and never sending a single request outside your network. You control how it works, how it’s updated, and what data it touches.Let’s break down how the architecture fits together.Architecture ExplanationThe offline MCP Client architecture is designed to deliver end‑to‑end private and local AI capability, without any reliance on cloud APIs or outbound network traffic. Here’s how it works:Developer: Prepares prompts or workflows using a local development environment (such as a secure IDE or terminal). All interactions originate and remain on the local device.MCP Client: Acts as the interface between the developer’s inputs and the AI model. It routes prompts to the embedded LLM, orchestrates the workflow, and handles results.Offline LLMs (LlamaCpp / Ollama): Powerful large language models are loaded and executed directly on the local hardware. No external API calls; all model inference and response generation happen on the device, fully offline.Local SQLite Database: Stores chat logs, prompts, and results securely and privately. Provides an audit trail and the ability to revisit past interactions, entirely within the local infrastructure.Secure UI/API: Presents results to the developer via a local web interface or terminal UI. Enables further integration with internal systems while ensuring data never leaves the trusted environment.Think about it. You don’t want your data, your prompts, or your workflows slipping out into the cloud. With this architecture, nothing leaves your machine.Zero external exposure. No tokens. No API keys. No hidden traffic.If you’re in aregulated industry, whether it’s defense, legal, healthcare, or any air-gapped environment, this setup checks every box. It keeps you compliant, private, and secure while still giving you the power of modern AI. And here’s the best part: it’sextensible by design.Want to add another LLM? Done.Need to customize workflows? Easy.Ready to experiment with agentic AI? Go ahead. You can build without ever breaking the privacy barrier.Most importantly, this isn’t a short-term solution. It’sfuture-proof. As on-device AI models become larger and smarter, this architecture will scale with you, handling more automation, more intelligence, and more complexity.Now it’s time to get our hands dirty and implement it.ImplementationUsing LM Studio, Streamlit, and Python, you’ll set up and run local open-source models directly on your machine. Unlike online AI assistants like ChatGPT or Google Bard, which constantly need internet connectivity and send data back to external servers, this approach runs completely offline.Along the way, you’ll gain hands-on experience with the full cycle: you’ll understand howlocal LLMsreally work, set up all the required software and dependencies, download and run an open-source model in LM Studio, and then build asimple yet powerful chat interfaceusing Streamlit. From there, you’ll integrate your local LLM into the Streamlit app and learn how to store and review chat historyusing a local database securely. By the end, you’ll have aBefore you dive into building your offlineEnterprise Assistant, it’s important to get familiar with a few key concepts.At the heart of this setup is the Offline Assistant itself: an AI system that runs entirely on your computer, performing all language model inference locally without ever needing an internet connection.Powering this is an LLM (Large Language Model), a type of AI trained on massive datasets to generate human-like text responses.To make it simple to use, you’ll rely on LM Studio, a desktop app that lets you download, run, and serve open-source LLMs on your machine, exposing them through a local API.For the interface, you’ll use Streamlit, a Python framework that makes it easy to build interactive web apps and quickly prototype AI-driven tools.And finally, for securely managing chat history, you’ll work with SQLite, a lightweight local database that keeps all your interactions private and fully stored on your device.By the end of this hands-on exercise, you’ll have your ownlocal Enterprise Assistantrunning directly in your browser—powered by an open-source LLM that operates fully offline throughLM Studio. You’ll interact with it using a simple but effective interface built withStreamlit, making your assistant practical and easy to use.Most importantly, every conversation will be securely stored as local chat logs in your system, never sent to the cloud, never exposed. By the time you’re done, you’ll walk away with a private, offline AI assistant that runs fast and stays entirely under your control.Demo Video and RepoLab guideConclusionCongratulations! You’ve just built your very own offline Enterprise Assistant, powered entirely by open-source tools and running fully on your machine. Along the way, you learned how to set up LM Studio to run an LLM locally, how to create a lightweight but effective interface with Streamlit, and how to store all your conversations securely using SQLite. Most importantly, you now understand how to put privacy first, keeping every prompt, response, and workflow under your complete control, with no reliance on external servers or cloud APIs.This hands-on exercise gave you more than just a working prototype. You gained insight into how local LLMs work, how to integrate them into real-world applications, and how to design AI tools that balance functionality with security. You’ve also seen the bigger picture: how on-device AI can reshape the way enterprises approach sensitive tasks, from R&D to legal reviews to compliance-heavy workflows.But this is only the beginning. You can now extend your Enterprise Assistant with advanced features:Add asmarter UIwith more interactive elements.Try outdifferent open-source modelsto experiment with speed, accuracy, and capabilities.Layer inanalytics and insightsto track and optimize your usage.Even push towardsagentic AI, giving your assistant the ability to automate tasks and workflows while still running securely offline.With what you’ve built, you’ve proven that you can harness the power of Generative AI without compromise: no data leaks, no internet dependency, no loss of control.Your private AI journey starts here.- SaurabhSponsored:Build your next app on HubSpot with the flexibility of an all-new Developer Platform.The HubSpot Developer Platform gives you the tools to build, extend, and scale with confidence. Create AI-ready apps, integrations, and workflows faster with a unified platform designed to grow alongside your business.Start Building Today📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

15 Sep 2025

3 min read

Learn AI Platform Engineering

Shreyans from Packt

15 Sep 2025

3 min read

From MongoDB's Director of Engineering CloudPro #107: Special Issue I'm interrupting our regular newsletter schedule today because something came up that I genuinely think you need to know about. I want to tell you about our event on September 27th that could be a game-changer for how you think about platform engineering. MongoDB's Director of Engineering shows you how to build AI-powered developer platforms that actually work at scale Exclusive 40% Off for CloudPro Subscribers Use code CLOUDPRO Here's why I'm personally excited about this: We've got George Hantzaras from MongoDB leading a 5-hour intensive on AI-Powered Platform Engineering. And when I say intensive, I mean it – this isn't another surface-level "AI is the future" talk. George is the Director of Engineering at MongoDB, speaks at Kubecon and HashiConf, and he's going deep into the practical stuff that actually matters. Agenda for the workshop: Self-Service Golden Paths – build workflows that reduce friction while keeping developer flexibility Knowledge as a Platform Capability – embed organizational knowledge with AI (RAG, context modeling) Intelligent Developer Portals – natural language interfaces and scaffolding services that understand developer needs AI-Driven Operations – anomaly detection, observability, and incident triage beyond traditional monitoring Why this matters for your daily work: If you’re working with monitoring stacks like Prometheus or Grafana, George’s approach to integrating runbooks, standards, and service catalogs into developer workflows will feel directly applicable. Exclusive 40% Off for CloudPro Subscribers Use code CLOUDPRO Our Panelists: We're not just doing sessions. We've also put together a panel: Ajay Chankramath – Founder, Platformetrics Dr. Gautham Pallapa – Principal Director, Cloud, Scotiabank Max Körbächer – Founder, Liquid Reply Together, they’ll unpack the real-world challenges and production patterns they’re seeing across industries. How You’ll Leave Prepared George is ending the day with something I've never seen at these events – a structured workshop to draft your actual 90-day pilot plan. You're walking out with a personalized roadmap, not just ideas. Why This Event is Different Focuses on implementation, not hype Gives you time to go deep (5 hours, not 50 minutes) Ends with an actionable plan, not just slide decks Exclusive 40% off for CloudPro subscribers Exclusive 40% Off for CloudPro Subscribers Use code CLOUDPRO Best, Shreyans Editor-in-Chief, CloudPro 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Kubernetes v1.33 now supports hybrid post-quantum TLS key exchange by default

Microsoft engineers contributed a new authentication method to Grafana

Kubernetes Faces Gaps in Handling Device Failures for AI/ML Pods

Migrating Uber’s Compute Platform to Kubernetes

Which call paths dominate at runtime: using Flame Graphs to visualize it!

How to Make Sure Your Kubernetes Sidecar Starts Before the Main App

Uber built a multi-cloud secrets platform to prevent leaks and automate security at scale

24 Hours Left: AI Powered Platform Engineering Workshop starts tomorrow

How MCP Turns Your IDP into an Actual Teammate

Don’t miss this: AI-Powered Platform Engineering workshop (Sept 27)

AI That Runs Entirely Offline: How to Build an Offline Enterprise Assistant

Learn AI Platform Engineering

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access