CloudPro

30 Jun 2025

8 min read

Migrating Uber’s Compute Platform to Kubernetes

30 Jun 2025

How to Break Up a Terraform Terralith Without Breaking EverythingCloudPro #97All Books $9.99 | 8 Hours RemainingSHOP NOW1. AWS’s own security tool introduced a privilege escalation risk2. Terraliths slowing you down? Here's how to break them up safely3. Uber’s 3M-core migration to Kubernetes: what it really tookPlus: BitM attacks that bypass MFA, schema migration via CI/CD, and a no-fluff guide to how Kubernetes CRDs actually work.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityAWS Launches Threat Technique Catalog to Share Real-World Attack DataAWS has released the Threat Technique Catalog, a resource mapping real-world attack techniques seen in customer incidents to the MITRE ATT&CK framework. Built from AWS CIRT investigations, it includes detection and mitigation advice for tactics like token abuse and misconfigured encryption. This gives cloud defenders a practical way to strengthen their AWS environments using adversary-informed data.AWS Launches Preview of Upgraded Security HubAWS has released a preview of its revamped Security Hub, now offering integrated dashboards, exposure mapping, and attack path visualizations to better prioritize and respond to security threats. It correlates findings across GuardDuty, Inspector, Macie, and CSPM to highlight critical gaps and risks.AWS Built a Security Tool. It Introduced a Security Risk.AWS’s “Account Assessment for AWS Organizations” tool unintentionally introduced a cross-account privilege escalation risk due to insecure deployment instructions. It advised users to avoid the management account without clarifying that deploying the hub role in a less secure account could expose high-sensitivity environments. AWS has since updated its documentation to recommend using a secure account.Forgotten DNS Records Enable CybercrimeA threat actor dubbed Hazy Hawk is hijacking abandoned cloud resources, like AWS S3 buckets and Azure endpoints, through dangling DNS records. By taking over subdomains of major organizations, including CDC, Deloitte, and universities, they reroute users to scams and malware via complex traffic distribution systems. The attacks exploit subtle DNS misconfigurations and show how unmanaged cloud resources can silently expose enterprise users to persistent threats.Browser-in-the-Middle Attacks Bypass MFA to Steal Sessions in Real TimeMandiant warns of a growing threat called Browser-in-the-Middle (BitM), where attackers proxy real login pages through their own browsers to steal fully authenticated sessions, even after MFA. BitM tools like Mandiant's internal “Delusion” make this scalable and fast, bypassing traditional phishing protections. Only hardware-backed MFA like FIDO2 or client certificates can reliably block these attacks.Workshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:-The top LLM vulnerabilities-Proven best practices for securing AI-generated code-Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register today⚙️ Infrastructure & DevOpsAWS CloudTrail Adds Detailed Logging for S3 Bulk DeletesAWS CloudTrail now logs individual object deletions made via the S3 DeleteObjects API, not just the bulk operation. This gives teams clearer visibility into which files were removed, improving audit trails and helping meet compliance and security needs. Granular logs also allow finer control via event selectors.AWS Backup adds new Multi-party approval for logically air-gapped vaultsAWS Backup now supports multi-party approval for logically air-gapped vaults, allowing secure recovery even if your AWS account is compromised. Admins can assign trusted approval teams to authorize vault access from outside accounts. This provides an independent, auditable recovery path, strengthening ransomware resilience and governance for critical backups.Inside AWS’s Strategy for Building Bug-Free, High-Performance SystemsAWS shared how it integrates formal and semi-formal methods, like TLA+, model checking, fuzzing, and deterministic simulation, into everyday development to eliminate bugs, boost developer speed, and enable aggressive optimizations. Tools like the P language and PObserve are used across S3, DynamoDB, EC2, and Aurora to model distributed systems, validate runtime behavior, and prove correctness of critical code paths.How to Break Up a Terraform Terralith Without Breaking EverythingLarge monolithic Terraform setups (“Terraliths”) can slow down deploys and increase risk. This guide lays out a clean migration path, starting with dependency mapping and backups, then moving to new root modules using import and removed blocks (in TF 1.7+), or scripted state mv operations. It also covers real-world lessons on inter-module communication, safe rollouts, automation, and state isolation, helping teams modernize IaC safely and modularly.Why It’s Time to Automate Your Database Schema MigrationsMany teams automate their app deployments but still manage database changes manually, leaving room for human error, schema drift, and security risks. This guide explains how tools like Atlas bring schema migrations into your CI/CD pipelines using declarative definitions, automatic diffs, and linting. The result: safer deployments, fewer production credentials, and consistent environments.📦 Kubernetes & Cloud NativeAmazon EKS Pod Identity adds cross-account access supportAmazon EKS Pod Identity now supports cross-account resource access without code changes. You can assign a second IAM role from another AWS account when creating a pod identity, enabling secure access to resources like S3 or DynamoDB via IAM role chaining. This simplifies multi-account architectures in EKS and reduces the complexity of credential management.Amazon GuardDuty expands Extended Threat Detection coverage to Amazon EKS clustersAmazon GuardDuty now detects advanced attack sequences in EKS clusters by correlating signals across audit logs, runtime activity, and API usage. This helps uncover threats like privilege escalation and secret exfiltration that might be missed by isolated alerts. It gives security teams a complete view of Kubernetes compromises and reduces time to investigate and respond.How CRDs Extend and Hook into the Kubernetes APIThis deep dive explains how Kubernetes Custom Resource Definitions (CRDs) work behind the scenes. It walks through how CRDs register with the Kubernetes API, how schemas validate custom objects, and how controllers fetch and handle them via client-go. You’ll learn how CRDs are serialized, discovered, and routed through the aggregation layer, giving you a detailed mental model for building robust Kubernetes extensions.Migrating Uber’s Compute Platform to KubernetesUber migrated all stateless services, powering 3M+ cores and 100K daily deployments, from Mesos to Kubernetes to standardize infrastructure and tap into the cloud-native ecosystem. They tackled extreme scale (7,500-node clusters), rebuilt integrations, and automated the shift using their internal “Up” platform. Custom solutions like artifact preservation, gradual scaling, and rollout heuristics ensured reliability, while Kubernetes UI and scheduler tweaks enabled smooth operations.Stop Building Platforms Nobody Uses: Pick the Right Kubernetes Abstraction with GitOpsThis post calls out a common pitfall: over-engineering internal platforms that developers don’t adopt. It argues that real developer pain: context switching, CI/CD complexity, insecure YAML sprawl, must shape the abstraction layer. Tools like Kro and Score can simplify Kubernetes via GitOps, but only when they reduce complexity without hiding critical decisions. The message: build abstractions that solve real problems, not just tick architectural boxes.🔍 Observability & SREAmazon VPC Route Server announces logging enhancementsAWS has added new monitoring features to VPC Route Server, including real-time logs for BGP and BFD sessions, historical data tracking, and flexible delivery via CloudWatch, S3, and Firehose. This helps engineers troubleshoot connectivity issues faster without needing AWS Support.Amazon Athena adds managed query results with built-in storage and cleanupAmazon Athena now supports managed query results, eliminating the need to preconfigure S3 buckets or manually clean up old results. This simplifies analysis workflows, especially for teams using automated workgroup creation.Grepr - Dynamic ObservabilityGrepr launched an ML-powered observability pipeline that filters, aggregates, and routes telemetry data before it hits your tools, reducing log volumes and storage costs significantly. It can scale automatically, backfill data during incidents, and runs alongside existing setups with minimal config. Ideal for teams seeking cost control without losing visibility.Chip auto-detects root causes without manual alerting or dashboardsChip is a zero-config monitoring agent that auto-instruments apps and alerts only on real customer-impacting issues. It tracks everything from code commits to Kubernetes events to find root causes fast, using real-time outlier and cohort detection. Built for fast-moving teams who want signal without the noise.Parseable offers fast, open-source observability on S3 with low resource useParseable is a lightweight, S3-first observability platform designed for speed and cost-efficiency. It delivers 90% faster queries than Elastic, uses up to 70% less CPU/memory, and integrates easily with AI and observability tools. Fully open source with no vendor lock-in.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!Disclaimer: Some eBooks and videos are excluded from the $9.99 offer. For selected countries, tiered discount pricing may vary.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

23 Jun 2025

11 min read

Which call paths dominate at runtime: using Flame Graphs to visualize it!

Shreyans from Packt

23 Jun 2025

11 min read

By Kaiwan N BillimoriaCloudPro #97This week’s CloudPro is a guest special from Kaiwan N Billimoria, the author of Linux Kernel Programming. Kaiwan runs world-class, seriously-valuable, high on returns, technical Linux OS (Corporate and Individual-Online) training programs at https://kaiwantech.com.In today’s issue, Kaiwan walks us through Flame Graphs: a powerful tool to visualize which call paths dominate at runtime and uncover performance bottlenecks.If you want to go deeper, his book Linux Kernel Programming is available for just $9.99 as part of Packt’s Summer Sale.Cheers,Shreyans SinghEditor-in-ChiefGET eBOOK at $9.99P.S. If you’re into platform engineering, check out Platform Weekly: the world’s largest newsletter for platform engineers with 100,000+ readers. Subscribe here.P.P.S. DeployCon is happening June 25. An engineer-first GenAI summit featuring teams from Meta, Tinder, DoorDash, and more. Join in person at the AWS Loft SF or online. Register now.Which Call Paths Dominate at Runtime: Using Flame Graphs to Visualize it!By Kaiwan N BillimoriaAnalyzing workloads is something all engineers end up doing at some point or another (or it’s their job description!). An obvious reason is performance analysis; for example, CPU usage may spike at times, causing issues or even outages.The need of the hour: observe, analyze, and figure out the root cause of the performance issue! Of course, that’s often easier said than done; this kind of work can bog down even experienced professionals...Borrowing from Brendan Gregg’s wonderful presentation (though old, it’s still relevant):In general, answering the ‘Who’ and the ‘How’ are simple(r):‘Who?’: well-known tools like top (and its numerous variants – htop, atop, etc) help answer this question.‘How?’: lots of system monitoring tools are available (vmstat, dstat, sar, nagios, cacti, nmon, iostat, nethogs, sysmon, etc.).The harder questions tend to be the ‘Why?’ and ‘What?’:‘Why?’: by generating a Flame Graph! (the topic of this short article)‘What?’: Flame Graphs as well as plain old perf!The following slide illustrates this (again, from Brendan Gregg):Right. So what the heck’s this Flame Graph thingy? Let’s explore!We’ll abbreviate Flame Graphs as FG.There are several types of FGs (CPU, GPU, memory, off-cpu, etc.); here we keep the focus on just one: CPU FGs via Linux’s powerful perf CPU profiler.The moment a tool can generate profiling data that includes stack traces, it implies that FGs can be generated! Thus, there are several tools besides perf that generate FGs:Windows: WPA, PerfView, Xperf.exeLinux: perf, eBPF, SystemTap, ktapFreeBSD: DTraceMac OS X: InstrumentsWe’ll focus only on using Linux perf; it’s considered one of the best modern CPU profiling tools on the platformMotivation for FGsWith perf, you can indeed profile your workload and see where exactly CPU usage shoots up. It’s easy: record something, get the report, and analyze it (well… it sounds easy at least).Example:Record a system-wide profiling (-a option switch) session with stack chain / backtrace (--call-graph dwarf, old option was -g), frequency of 99 Hz, for 10 seconds:sudo perf record -F 99 -a --call-graph dwarf -- sleep 10(Instead of the -a option switch, you can use the -p PID option to profile a particular process. The generated perf.data file’s owned by root; do a chown to place its ownership under your account if you wish.)Get the perf report:sudo perf report --stdio # or --tui…(Try it!).This begs the question – so why not just use perf? Ah, that’s the thing: on non-trivial workloads, the report can be simply humongous, even going into dozens of (printed) pages! Are you really going to read through all of it, trying to spot the outliers?Visualization with the CPU Flame GraphIt’s why we use the so-called Flame Graph (FG) – to visualize dense textual data and make sense of it; it’s so much clearer (so much more humane, literally).InstallationFirst off, ensure both the perf utility and the FlameGraph scripts are installed.Quick note: to install perf on Ubuntu/Debian, you typically need to be on a distro kernel (not a custom one).Why? Because – unusually for an app – it’s tightly coupled to the kernel it runs on! Assuming you’re on an Ubuntu/Debian distro, do this: sudo apt install linux-perf-$(uname -r) linux-tools-generic (even the linux-tools-generic package might be sufficient).If you’re on a custom-built kernel, build perf (it’s easy): cd <kernel-src-tree>/tools/perf ; make .Install FG from here or do (in an empty folder):git clone --depth 1 https://github.com/brendangregg/FlameGraph.gitSteps to generate a Flame GraphProfile the workload using perf:perf record -F 99 --call-graph dwarf [-a]|[-p pid]-a: all cpus; in effect, if specified, the sample is system-wide-p: sample a particular process.Generates the perf.data binary file.Read from perf.data (default, else use -i <fname>) to convert the binary data to human-readable stack traces via perf script:perf script > perfscript_out.datGenerate the FG, a Scalable Vector Graphic (SVG) file:The FG repo includes several stackcollapse-* scripts; we use the stackcollapse-perf.pl one:cat perfscript_out.dat | FlameGraph/stackcollapse-perf.pl \ | FlameGraph/flamegraph.pl > out.svgOpen the SVG in a web browser, move the mouse over stack frames.A Quick Test RunWe’ll assume you’ve installed both perf and the Flame Graph GitHub repo (the latter under your home dir).Profile: record everything for 10ssudo perf record -F 99 -a --call-graph dwarf -- sleep 10sudo chown ${LOGNAME}:${LOGNAME} perf.dataperf script > perfscript_out.datcat perfscript_out.dat | ~/FlameGraph/stackcollapse-perf.pl |~/FlameGraph/flamegraph.pl > out.svgOpen the SVG file in a web browser. Here’s a screenshot of the Flame GraphHmm, better if we zoom in… so I click on one of the rectangles on the lower-left (say on the gnome-shell one):Ah, better.Interpreting the Flame GraphSome really key points regarding how to interpret the Flame Graph:Each rectangle represents a single stack frame; read it bottom-up.The width is representative of the frequency of the function call.The height is representative of the depth of the stackThe order of rectangles from left-to-right is just alphabetical; it's not a timeline.The colors don’t signify anything special.You can (typically) use the browser Search (Ctrl-F) to search for a function by name.Click on a stack frame (a rectangle) to zoom into that tower. Click Reset Zoom (upper-left corner) to zoom back out.In effect: the hottest code-paths – the ones that dominate - are the widest rectangles!The top-edge – the rectangle at the very top - is the function on-CPU; beneath is ancestry (how it was invoked).Here’s another FG I captured while SSH was running (truncated screenshot showing the interesting portion):Interesting; the “towers” seem to be inverted! Yes, they’ve becomes top-down (downward-growing stacks) instead of bottom-up… they’re called icicles!An option to the perf script command sets this up.A fantastic thing about the FG is that both userspace and kernel-space functions are captured! It’s thus called a mixed-mode FG. For e.g., with the ‘ssh’ FG, you can clearly see the call path leading down to the kernel network protocol stack code – functions from the socket/INET layer sock_*(), followed by L4 tcp_*(), followed by the L3 ip_*() functions; even the invocation of the (network) device transmit – the dev_hard_start_xmit() and others – are visible!My flamegrapher.sh wrapper scriptsNext, to make this a bit easier to use (no need to remember the syntax, easier options), I wrote a wrapper over the original Flame Graph scripts; the top-level one’s named flamegrapher.sh: https://github.com/kaiwan/L5_user_debug/tree/main/flamegraph (it forms a portion of my ‘Linux Userspace Debugging – Tools & Techniques’ training repo).It’s Help screen reveals how you can – very easily! – use it to generate FGs:$ ./flame_grapher.shUsage:flame_grapher.sh -o svg-out-filename(without .svg) [options ...]-o svg-out-filename(without .svg): name of SVG file to generate (saved under /tmp/flamegraphs/)Optional switches:[-p PID]: PID = generate a FlameGraph for ONLY this process or threadIf not passed, the *entire system* is sampled...[-s <style>]: normal = draw the stack frames growing upward [default]icicle = draw the stack frames growing downward[-t <type>]: graph= produce a flame graph (X axis is NOT time, merges stacks) [default]Good for performance outliers (who's eating CPU? using max stack?); works well for multi-threaded appschart= produce a flame chart (sort by time, do not merge stacks)Good for seeing all calls; works well for single-threaded apps[-f <freq>]: frequency (HZ) to have perf sample the system/process at [default=99]Too high a value here can cause issues-h|-?: show this help screen.Note:After pressing ^C to stop, please be patient... it can take a while to process.The FlameGraph SVG (and perf.data file) are stored in the volatile /tmp/flamegraphs dir; copy them to a non-volatile location to save them.Notice a few points:The only mandatory option switch is -o fname; it generates an SVG file named fname.svg.There are two ‘types’ of FG’s we can generate:graph [default]: Produce an FG (X axis is NOT time, merges stacks). This type’s good for performance outliers (who's eating CPU? using max stack?); works well for multi-threaded apps.chart : Produce a flame chart – it’s sorted by time, do not merge stacks. Good for seeing all calls; works well for single-threaded apps.You can optionally specify a particular process (by -p PID) to profile, change the style to icicle, and set the profiling frequency.The metadata and the SVG is stored under /tmp; copy it to a non-volatile location if you want it saved!(Do read README.md as well. Hey, this wrapper’s lightly tested; please help me (and everyone!) out by raising Issues, as and when you come across them!)Tip: Try the speedscope.app site to interact with your FlameGraph!Flame Graphs: Caveats/IssuesFrame Pointers being present helps get good stack traces, BUT the -fomit-frame-pointer is the typical GCC flag passed!Possible exception case is the Linux kernel itself; it has intelligent algorithms to emit accurate stack trace even in the absence of frame pointers.Symbols are required (can use a separate symbol file). A side effect of no symbols may be ill-formed (or close to zero) stack traces.VMs may not support the PMCs (performance measurement counters) that perf requires; in that case, FGs (or perf) don’t really work well.Bonus materialB Gregg’s Linux Performance Observability Tools diagram across the stack!TipsWith [e]BPF becoming a powerhouse for many things, including observability, do look up equivalent eBPF tooling as well: https://www.brendangregg.com/ebpf.html (a similar diagram’s here!).Also be sure to check out B Gregg’s (and others) utility package wrappers: perf-tools and bpfcc-tools.Don’t ignore systemd’s systemd-analyze tool (boot-time).Perf: simply running sudo perf top is itself useful to find outliers; I keep a couple of aliases as well:alias ptop='sudo perf top --sort pid,comm,dso,symbol 2>/dev/null'alias ptopv='sudo perf top -r 80 -f 99 --sort pid,comm,dso,symbol \--demangle-kernel -v --call-graph dwarf,fractal 2>/dev/null'GET Linux Kernel Programming at $9.99What did you think of this special issue📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

16 Jun 2025

9 min read

How to Make Sure Your Kubernetes Sidecar Starts Before the Main App

Shreyans from Packt

16 Jun 2025

9 min read

Why Automatic Rollbacks Are Risky and Outdated in Modern DevOpsCloudPro #96Platform Weekly - the world’s largest platform engineering newsletterWith over 100,000 weekly readers Platform Weekly dives into platform engineering best practices, platform engineering news, and highlights, lessons and initiatives from the platform engineering community.Subscribe Now📌 A hidden prompt injection flaw in GitLab Duo that quietly leaked source code📌 Just-in-time AWS access using Entra PIM (yes, that’s possible now)📌 Cloud SQL charging 2TB storage for 6GB of data, because of WAL logs📌 Why automatic rollbacks in DevOps might be doing more harm than goodYou’ll also find sharp reads on scaling Terraform teams, new volume tools for AI/ML in GKE, and a brutally honest take on Kubernetes complexity. On the observability side, AWS added visual dashboards to Network Firewall, and OpenTelemetry clarified how to treat logs vs. events.Hope you find something that helps you ship safer, smarter, or faster.Cheers,Shreyans SinghEditor-in-ChiefPS: If you’re not already reading Platform Weekly, I’d recommend it.It’s one of the few newsletters I make time for every week: focused on platform engineering, cloud native, and the kind of problems teams actually face. 100,000+ people read it, but it still feels like it’s written by someone who gets it.Here’s the link if you want to check it outSubscribe Now🔐 Cloud SecurityJust-in-time AWS Access to AWS with Entra PIMJust‑in‑time privileged access can be implemented by integrating Microsoft Entra PIM with AWS IAM Identity Center using SCIM/SAML, enabling temporary group-based access tied to approval workflows and time limits. By mapping Entra security groups to AWS permission sets (e.g. EC2AdminAccess) and enabling eligibility/activation in PIM, users gain access only when approved, and only for a set duration.On‑Demand Rotation Now Available for KMS Imported KeysAWS KMS now lets you rotate imported symmetric key material on‑demand without needing to create a new key or change its ARN, simplifying compliance and security by avoiding workload disruptions. New API operations, including RotateKeyOnDemand and KeyMaterialId tracking, let you import, rotate, audit, expire, or delete individual key versions while retaining decryption access to older ciphertext.CloudRec: multi-cloud security posture management (CSPM) platformCloudRec is an open‑source, scalable CSPM platform that continuously discovers 30+ cloud services across AWS, GCP, Alibaba, and more, offering real‑time risk detection and remediation.It uses OPA‑based declarative policy management, enabling dynamic, flexible rule definitions without code changes or redeployment.How to use the new AWS Secrets Manager Cost Allocation Tags featureAWS Secrets Manager now supports cost allocation tags, letting you tag each secret (e.g., with CostCenter) and track its costs in Cost Explorer or cost-and-usage reports.Enable tags in Billing → Cost Allocation Tags, then filter or group secrets costs by tag to see spend per department or project.GitLab Duo Prompt Injection Leads to Code and Data ExposureA hidden prompt injection flaw in GitLab Duo allowed attackers to embed secret instructions, camouflaged in comments, code, or MR descriptions, triggering the AI assistant to reveal private source code. The attacker leveraged streaming markdown rendering and HTML injection (like <img> tags) to exfiltrate stolen code via base64-encoded payloads. GitLab patched the vulnerability in February 2025, blocking unsafe HTML elements and tightening input handling.⚙️ Infrastructure & DevOpsAmazon API Gateway introduces routing rules for REST APIsAmazon API Gateway now supports routing rules for REST APIs on custom domains, allowing dynamic routing based on HTTP headers, URL paths, or both. This enables direct A/B testing, API versioning, and backend selection, removing the need for proxies or complex URL structures.Amazon EC2 now enables you to delete underlying EBS snapshots when deregistering AMIsEarlier, snapshots had to be removed separately, often leading to orphaned volumes and wasted spend. Now. AWS EC2 will let users automatically delete EBS snapshots when deregistering AMIs, cutting down on manual cleanup and storage costs. This update streamlines resource management with no extra cost and is available across all AWS regions.Why is your Google Cloud SQL bill so high?A developer discovered that their Cloud SQL instance showed 2 TB of usage for only 6 GB of actual data, due to retained Write-Ahead Logs (WAL) from Point-in-Time Recovery. These logs can silently bloat storage costs when frequent transactions occur. To control costs, users should reduce WAL retention or re-provision instances with right-sized storage.Why Automatic Rollbacks Are Risky and Outdated in Modern DevOpsAutomatic rollbacks seem helpful but often fail due to the same issues that break deployments, like expired credentials or partial database changes. Modern practices like Continuous Delivery and progressive deployment (canary, blue/green, feature flags) offer safer, faster recovery paths. Human oversight adds resilience and learning, making manual intervention more effective than rollback automation.How to structure Terraform deployments at scaleAt scale, Terraform deployments require a clear structure that balances control and team autonomy. Scalr’s two-level hierarchy: Account and Environment scopes, lets central DevOps manage policies and modules, while engineers deploy independently within isolated workspaces. This setup encourages reusable code and standardization through a shared module registry.📦 Kubernetes & Cloud NativeMaking Kubernetes Event Management Easier with Custom AggregationAs Kubernetes clusters grow, managing events becomes harder due to high volume, short retention, and poor correlation. This article shows how to build a custom event system that groups related events, stores them longer, and spots patterns: helping teams debug issues faster. It uses Go to watch, process, and store events, and includes options for alerts and pattern detection.GKE Volume Populator Simplifies AI/ML Data Transfers in KubernetesGoogle Cloud’s new GKE Volume Populator helps AI/ML teams automatically move data from Cloud Storage to fast local storage like Hyperdisk ML, no custom workflows needed. It uses Kubernetes-native PVCs and CSI drivers to manage transfers, delays pod scheduling until data is ready, and supports fine-grained access control.How to Make Sure Your Kubernetes Sidecar Starts Before the Main AppIf your app depends on a sidecar, Kubernetes doesn’t guarantee the sidecar is fully ready before the main container starts, even with the new native support. This article shows how to delay the app start using startupProbe or postStart hooks in the sidecar. These methods let the app wait until the sidecar is actually ready, avoiding startup errors without needing code changes.Not every problem needs KubernetesKubernetes promises scalability and flexibility, but for most teams, it adds unnecessary complexity. Many workloads can be handled more easily with VMs, managed cloud services, or simpler container platforms like AWS Fargate or Google Cloud Run. Unless you truly need hybrid cloud, global scale, or run hundreds of services, Kubernetes may just slow you down and drain resources.What You Actually Need for Kubernetes in ProductionProduction Kubernetes setups need more than just working clusters. Use readiness, liveness, and startup probes correctly to avoid early traffic issues or restarts. Always define CPU and memory limits, isolate secrets using volumes, and enforce RBAC with least privilege. Use HPA for scaling, avoid local storage, and apply network policies to control traffic. Tools like kube-bench, Trivy, and FluentBit help monitor security, cost, and logs effectively.Book Now🔍 Observability & SREAWS Network Firewall launches new monitoring dashboardAWS Network Firewall now includes a monitoring dashboard that shows key traffic patterns like top flows, TLS SNI, HTTP host headers, long-lived TCP flows, and failed handshakes. This helps teams troubleshoot issues and spot security concerns faster. It’s available in all supported regions at no extra firewall cost, but requires Flow and Alert logs to be configured.Official RCA for SentinelOne Global Service InterruptionSentinelOne’s May 29 global service outage was caused by a software flaw in a deprecated infrastructure control system, which accidentally deleted critical network routes. This broke internal connectivity, taking down management consoles and related services. While customer endpoints stayed protected, teams lost visibility and control during the incident.There's a Lot of Bad Telemetry Out ThereMuch of today’s telemetry is noisy, irrelevant, or misleading: causing higher costs, slow troubleshooting, and poor decisions. Common problems include incomplete traces, outdated metrics, irrelevant logs, and data overload. Engineers often lack clear standards or guidance on good telemetry, especially for newer systems like LLMs. To fix this, teams should define what's useful, apply consistent conventions (e.g. OpenTelemetry), and work closely with devs to improve instrumentation at the source.OpenTelemetry Clarifies Its Approach to Logs and EventsOpenTelemetry treats logs as structured records sent through its Logs API, with a special focus on events: logs with a defined schema and guaranteed structure. Events are preferred for new instrumentation, as they integrate with context and can correlate with traces and metrics. Unlike spans, events have no duration or hierarchy. OpenTelemetry recommends using logs mainly for bridging existing systems, while semantic instrumentation should rely on events for consistency and context sharing.Storing all of your observability signals in one place matters!Treating traces, logs, and metrics as separate “pillars” creates silos and hinders correlation. Many teams still split signals across tools or vendors, leading to fragmented insights and painful debugging. A centralized “single pane of glass” setup helps correlate signals in one place, making it easier to understand system behavior.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

06 Feb 2026

5 min read

A blueprint for cyber resilience...

Shreyans from Packt

06 Feb 2026

5 min read

.CloudPro #118Attackers are actively trying to keep you from recoveringIn the event of a cyberattack, the cost of downtime is measured not just in financial terms, but in operational disruption and reputational damage. While prevention strategies are crucial, they are not a substitute for a robust recovery plan.Backups alone do not guarantee a clean restoration.We invite you to our virtual event,Foundations of Cyber Resilience, on 11 February, where we will provide a practical framework for what happens after a breach.You will learn:Whytraditional recovery strategies can fail when they are needed most.Howto detect and eliminate threats within your backups to prevent reinfection.Key componentsof a modern, orchestrated, and clean recovery process.REGISTER NOWNext week in CloudPro, we're dropping something special: a deep-dive from Microsoft Azure MVP Stéphane Eyskens that every cloud architect needs to read.If you've ever wondered why your multi-region Azure setup feels more complex than it should, or if you're still figuring out what actually happens when a region goes down, this one's for you.Stéphane, author of the 5-star rated Azure Cloud Native Architecture Mapbook, is sharing battle-tested patterns for building truly resilient systems using Azure SQL, Cosmos DB, and Storage. We're talking real code, Terraform scripts, and the kind of insights you only get from years in the trenches.Here's a sneak peek into what's coming...Cheers,Shreyans SinghEditor-in-ChiefHow to Build Always-On Applications on AzureBy Stephane EyskensA Sneak PeekBefore getting into the details, let's briefly revisit the difference between high availability (HA) and disaster recovery (DR).HA and DR exist on a spectrum, with increasing levels of resilience depending on the type of failure you want to withstand:Application-level failures: In some cases, you may simply want to tolerate application bugs—for example, a memory leak introduced by developers. Running multiple instances of the application on separate virtual machines, even on the same physical host, can already prevent a full outage when one instance exhausts its allocated memory. That is for instance, what you would get if you spin up 2 instances of an Azure App Service within the same zone (no zone redundancy).Hardware failures: To handle hardware failures, workloads should be distributed across multiple racks. That is what you would get if you'd host virtual machines on availability sets.Data centre–level outages: To withstand more severe incidents, workloads should be spread across multiple data centers, such as by deploying them across multiple availability zones. You can achieve this by turning on zone-redundancy on Azure App Service or use zone-redundant node pools in AKS. With such a setup, you should survive a local disaster such as fire, flooding, etc.Regional outages: Finally, to survive major outages, such as a major earthquake, a country-level power supply issue, etc., workloads must be deployed across multiple Azure regions in active/active or active/passive mode.Next week, Stéphane walks through exactly how to architect for each scenario, with diagrams, code, and real failover examples you can test yourself. Don't miss it.Early Bird closes in 72 hours. Last Few Seats At This Price.Book Your Seat NowUse code EARLYBIRD40 to get 40% OffWe're running a 5-hour workshop on architecting production-grade GenAI systems on AWS. Hands-on, practical, built for cloud architects and engineers.Here's the deal:Most GenAI content is either toy demos that work once or vendor pitches. This isn't that.We took real production problems: models breaking after launch, RAG pipelines failing silently, agents that cost too much or hallucinate in production, and turned them into architectural patterns using AWS services and real-world trade-offs.You'll learn how to pick the right AWS model for quality, cost, and latency.You'll build and tune RAG pipelines that don't break when data changes.And you'll understand when to use agents versus when they'll create more problems than they solve.Early Bird closes in 72 hours. Last Few Seats At This Price.Book Your Seat NowUse code EARLYBIRD40 to get 40% OffEarly Bird Offer LIVE Now: Get 40% Off TicketsBook Your Seat NowUse code EARLY40 to get 40% Off📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day! *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Shreyans from Packt

09 Jun 2025

9 min read

Uber built a multi-cloud secrets platform to prevent leaks and automate security at scale

Shreyans from Packt

09 Jun 2025

9 min read

How to Block Up to 95% of Attacks Using AWS WAFCloudPro #95A better way to handle vendor security reviews?If you've ever dealt with vendor onboarding or third-party cloud audits, you know how painful it can be: long email chains, stale spreadsheets, and questionnaires that don’t reflect what’s actually happening in the cloud.We recently came across CloudVRM, and it’s a refreshingly modern take on the problem.Instead of asking vendors to fill out forms or send evidence, CloudVRM connects directly to their AWS, Azure, or GCP environments. It pulls real-time telemetry every 24 hours, flags misconfigs, and maps everything to compliance frameworks like SOC 2, ISO 27001, and DORA.It’s already being used by banks and infra-heavy orgs to speed up vendor approvals by 85% and reduce audit overhead by 90%.Worth checking out if you're building or maintaining systems in regulated environments, or just tired of spreadsheet security.Watch the demoThis week’s CloudPro kicks off with something genuinely useful: a tool that replaces vendor security questionnaires with real-time cloud evidence.📌CloudVRM connects directly to AWS, Azure, or GCP and auto-checks compliance, no spreadsheets, no guesswork📌AWS CloudTrail silently skipping logs if IAM policies get too large (and attackers know it)📌PumaBot is now brute-forcing IoT cameras and stealing SSH credsWe’ve also got sharp engineering writeups: from how Uber rotates 20K secrets a month, to how Netflix handles 140 million hours of viewing data daily, to one team’s story of slicing a $10K Glue bill down to $400 with Airflow.Hope you find something in here that saves you time, money, or migraines.Cheers,Shreyans SinghEditor-in-Chief🔐 Cloud SecurityAWS CloudTrail logging can be bypassed using oversized IAM policiesResearchers at Permiso Security found that AWS CloudTrail fails to log IAM policies between 102,401 and 131,072 characters if they're inflated using whitespace. This gap allows attackers to hide malicious changes from audit logs. The issue stems from undocumented size limits and inconsistent handling of policy data. AWS has acknowledged the problem and plans a fix in Q3 2025.PumaBot targets Linux-based IoT surveillance devices via SSH brute forceA new botnet called PumaBot is targeting IoT surveillance systems by brute-forcing SSH access using IP lists from its command-and-control server. Written in Go, the malware disguises itself as system files, adds persistence through systemd, and installs custom PAM modules to steal credentials. Related binaries in the campaign also auto-update, spread across Linux systems, and exfiltrate login data.How to Block Up to 95% of Attacks Using AWS WAFThis guide explains how to configure AWS Web Application Firewall (WAF) to block threats like SQL injection, XSS, bots, and DDoS attacks with minimal effort. By leveraging pre-built managed rules and setting up a Web ACL, users can protect apps behind ALB, CloudFront, or API Gateway without custom code.CloudPEASS: Toolkit to find and exploit cloud permissions across AWS, Azure, and GCPCloudPEASS helps red teamers and defenders map out permissions in compromised cloud accounts without modifying resources. It supports AWS, Azure, and GCP, detecting privilege escalation paths using API access, brute-force permission testing, and AI-assisted analysis. It also checks Microsoft 365 services in Azure and enables Gmail/Drive token access in GCP.Uber built a multi-cloud secrets platform to prevent leaks and automate security at scaleTo manage over 150,000 secrets across services and vendors, Uber developed a centralized secrets management platform. It blocks leaks in code with Git hooks, scans systems in real time, and consolidates 25 vaults into 6. The platform enables auto-rotation, access tracking, and third-party secret exchange via SSX. It now rotates ~20,000 secrets monthly and is evolving toward secretless auth and workload identity federation.BOOK NOW AT 25% OFF⚙️ Infrastructure & DevOpsAWS Cost Explorer now offers a new Cost Comparison featureAWS launched a new Cost Comparison feature in Cost Explorer that highlights key changes in cloud spend between two months. It automatically identifies top cost drivers, like usage shifts, discounts, or refunds, without needing manual spreadsheets. A new “Top Trends” widget shows the biggest changes at a glance, and deeper insights are now available through the Compare view.Go-based Git Add Interactive tool adds advanced staging and patch filteringThis Go port of git add -i/-p enhances Git’s interactive staging with features like global regex filters, auto-hunk splitting, and multi-mode patch operations (stage, reset, checkout). It supports keyboard shortcuts, color-coded UI, and fine-grained hunk control across all files.GitLab-based monorepo streamlines Terraform module versioning and securityThis setup uses a GitLab CI pipeline to manage Terraform modules in a monorepo, with automated versioning, linting, and security scans via tools like TFLint, tfsec, and Checkov. Git tags handle module versions without extra auth tokens. The workflow enforces changelogs, labels, and approvals, and publishes docs and tags post-merge.A fully automated fix for Terraform’s backend bootstrapping problem on AzureThis guide solves the common issue where Terraform needs a backend to store state, but can’t create it without an existing backend. It automates the creation of an Azure Blob backend using Terraform itself, then seamlessly switches to that backend by generating partial config files and migrating state. The setup includes secure access via managed identity and GitHub OIDC, enabling CI/CD workflows without manual secrets or scripts.Using Terraform to automate disaster recovery infrastructure and failoversThis post explains DR strategies like Pilot Light and Active/Passive, and shows how Terraform enables flexible, cost-efficient deployments using conditionals and modular IaC. A working AWS example demonstrates DNS failover and dynamic EC2 provisioning using a toggle variable. This lets teams switch between production and DR environments with minimal effort, reducing downtime and idle resource costs.📦 Kubernetes & Cloud NativeGateway API v1.3.0 Adds Smart Mirroring and New Experimental ControlsGateway API v1.3.0 is now GA with percentage-based request mirroring, letting teams test blue-green deployments without full traffic duplication. The release also debuts experimental support for CORS filters, retry budgets, and listener merging via new X-prefixed APIs. These features help fine-tune request handling, scale listener configs across namespaces, and manage retry spikes, without upgrading Kubernetes itself.Introducing Gateway API Inference ExtensionThe new Gateway API Inference Extension introduces model-aware routing for GenAI and LLM services running on Kubernetes. It adds InferenceModel and InferencePool resources to better match requests with the right GPU-backed model server based on real-time load. Early benchmarks show reduced latency under heavy traffic compared to standard Services, helping ops teams optimize resource usage and avoid contention.Deep Dive into VPA 1.3.0: Smarter Resource Tuning for Kubernetes PodsThis post explores how the Vertical Pod Autoscaler (VPA) v1.3.0 uses historical and real-time metrics to recommend CPU and memory resource requests. It focuses on the Recommender component, which aggregates usage into decaying histograms to auto-tune workloads and reduce resource waste.Default Helm Charts Leave Kubernetes Clusters at RiskMicrosoft researchers warn that many open-source Helm charts deploy with insecure defaults, exposing services like Apache Pinot, Meshery, and Selenium Grid to the internet without proper authentication. These misconfigurations often include LoadBalancers or NodePorts with no access controls, making them easy targets for attackers. Teams should avoid "plug-and-play" setups and review YAML/Helm configs before deploying to production.Batch Scheduling in Kubernetes: YuniKorn vs Volcano vs KueueKubernetes lacks native support for batch workloads like ML training and ETL jobs, prompting the rise of tools like Apache YuniKorn, Volcano, and Kueue. YuniKorn replaces the default scheduler with strong multi-tenancy support; Volcano focuses on high-performance use cases with gang scheduling; and Kueue integrates natively to manage job queues without altering core scheduling.🔍 Observability & SREWhat's new in Grafana v12.0Grafana v12.0 introduces Git-based dashboard versioning, dynamic layouts, and experimental APIs for managing observability as code. Drilldowns for metrics, logs, and traces are now GA, enabling queryless deep dives across signals. SCIM support simplifies team provisioning, and a new “Recovering” alert state reduces flapping.Sentry Launches Logs in Open Beta to Boost Debugging ContextSentry now supports direct log ingestion in open beta, letting developers view application logs alongside errors and traces in a single interface. This integration adds vital context, like retry attempts or upstream responses, to help identify root causes faster without switching tools.How to use Prometheus to efficiently detect anomalies at scaleGrafana Labs has built and open-sourced an anomaly detection system using only PromQL: no external tools or services required. It computes dynamic bands using rolling averages, standard deviation, and seasonal patterns, with tunable sensitivity and smoothing to reduce false positives. The framework scales across tenants and works with any Prometheus-compatible backend, making it easy to plug into SLO-based alerts for better incident context.Beyond API uptime: Modern metrics that matterTraditional uptime checks fall short in today’s fast-paced environments where even minor API delays can cause major user churn. Catchpoint’s Internet Performance Monitoring (IPM) combines global synthetic tests, percentile-based metrics, and user-centric objectives to detect slowdowns before they escalate. With features like API-as-code, chaos engineering, and CI/CD integration, IPM helps teams catch latency issues early and simulate real-world failures.Microservices Monitoring: Metrics, Challenges, and Tools That MatterMonitoring microservices requires more than just uptime: it demands insight into latency, throughput, error rates, resource use, and inter-service communication. Tools like Middleware, Prometheus-Grafana, and Dynatrace help track these metrics at scale, support alerting, and simplify root cause analysis. Best practices include centralized logging, distributed tracing, automation, and continuous optimization to maintain performance in complex distributed systems.Forward to a Friend📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.If you have any comments or feedback, just reply back to this email.Thanks for reading and have a great day!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0

CloudPro

Migrating Uber’s Compute Platform to Kubernetes

Which call paths dominate at runtime: using Flame Graphs to visualize it!

How to Make Sure Your Kubernetes Sidecar Starts Before the Main App

A blueprint for cyber resilience...

Uber built a multi-cloud secrets platform to prevent leaks and automate security at scale

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access