CloudPro #95

A better way to handle vendor security reviews?

uber-built-a-multi-cloud-secrets-platform-to-prevent-leaks-and-automate-security-at-scale-img-0

If you've ever dealt with vendor onboarding or third-party cloud audits, you know how painful it can be: long email chains, stale spreadsheets, and questionnaires that don’t reflect what’s actually happening in the cloud.

We recently came across CloudVRM, and it’s a refreshingly modern take on the problem.

Instead of asking vendors to fill out forms or send evidence, CloudVRM connects directly to their AWS, Azure, or GCP environments. It pulls real-time telemetry every 24 hours, flags misconfigs, and maps everything to compliance frameworks like SOC 2, ISO 27001, and DORA.

It’s already being used by banks and infra-heavy orgs to speed up vendor approvals by 85% and reduce audit overhead by 90%.

Worth checking out if you're building or maintaining systems in regulated environments, or just tired of spreadsheet security.

Watch the demo

This week’s CloudPro kicks off with something genuinely useful: a tool that replaces vendor security questionnaires with real-time cloud evidence.

📌CloudVRM connects directly to AWS, Azure, or GCP and auto-checks compliance, no spreadsheets, no guesswork

📌AWS CloudTrail silently skipping logs if IAM policies get too large (and attackers know it)

📌PumaBot is now brute-forcing IoT cameras and stealing SSH creds

We’ve also got sharp engineering writeups: from how Uber rotates 20K secrets a month, to how Netflix handles 140 million hours of viewing data daily, to one team’s story of slicing a $10K Glue bill down to $400 with Airflow.

Hope you find something in here that saves you time, money, or migraines.

Cheers,

Shreyans Singh

Editor-in-Chief

🔐 Cloud Security

AWS CloudTrail logging can be bypassed using oversized IAM policies

Researchers at Permiso Security found that AWS CloudTrail fails to log IAM policies between 102,401 and 131,072 characters if they're inflated using whitespace. This gap allows attackers to hide malicious changes from audit logs. The issue stems from undocumented size limits and inconsistent handling of policy data. AWS has acknowledged the problem and plans a fix in Q3 2025.

PumaBot targets Linux-based IoT surveillance devices via SSH brute force

A new botnet called PumaBot is targeting IoT surveillance systems by brute-forcing SSH access using IP lists from its command-and-control server. Written in Go, the malware disguises itself as system files, adds persistence through systemd, and installs custom PAM modules to steal credentials. Related binaries in the campaign also auto-update, spread across Linux systems, and exfiltrate login data.

How to Block Up to 95% of Attacks Using AWS WAF

This guide explains how to configure AWS Web Application Firewall (WAF) to block threats like SQL injection, XSS, bots, and DDoS attacks with minimal effort. By leveraging pre-built managed rules and setting up a Web ACL, users can protect apps behind ALB, CloudFront, or API Gateway without custom code.

CloudPEASS: Toolkit to find and exploit cloud permissions across AWS, Azure, and GCP

CloudPEASS helps red teamers and defenders map out permissions in compromised cloud accounts without modifying resources. It supports AWS, Azure, and GCP, detecting privilege escalation paths using API access, brute-force permission testing, and AI-assisted analysis. It also checks Microsoft 365 services in Azure and enables Gmail/Drive token access in GCP.

Uber built a multi-cloud secrets platform to prevent leaks and automate security at scale

To manage over 150,000 secrets across services and vendors, Uber developed a centralized secrets management platform. It blocks leaks in code with Git hooks, scans systems in real time, and consolidates 25 vaults into 6. The platform enables auto-rotation, access tracking, and third-party secret exchange via SSX. It now rotates ~20,000 secrets monthly and is evolving toward secretless auth and workload identity federation.

BOOK NOW AT 25% OFF

⚙️ Infrastructure & DevOps

AWS Cost Explorer now offers a new Cost Comparison feature

AWS launched a new Cost Comparison feature in Cost Explorer that highlights key changes in cloud spend between two months. It automatically identifies top cost drivers, like usage shifts, discounts, or refunds, without needing manual spreadsheets. A new “Top Trends” widget shows the biggest changes at a glance, and deeper insights are now available through the Compare view.

Go-based Git Add Interactive tool adds advanced staging and patch filtering

This Go port of git add -i/-p enhances Git’s interactive staging with features like global regex filters, auto-hunk splitting, and multi-mode patch operations (stage, reset, checkout). It supports keyboard shortcuts, color-coded UI, and fine-grained hunk control across all files.

GitLab-based monorepo streamlines Terraform module versioning and security

This setup uses a GitLab CI pipeline to manage Terraform modules in a monorepo, with automated versioning, linting, and security scans via tools like TFLint, tfsec, and Checkov. Git tags handle module versions without extra auth tokens. The workflow enforces changelogs, labels, and approvals, and publishes docs and tags post-merge.

A fully automated fix for Terraform’s backend bootstrapping problem on Azure

This guide solves the common issue where Terraform needs a backend to store state, but can’t create it without an existing backend. It automates the creation of an Azure Blob backend using Terraform itself, then seamlessly switches to that backend by generating partial config files and migrating state. The setup includes secure access via managed identity and GitHub OIDC, enabling CI/CD workflows without manual secrets or scripts.

Using Terraform to automate disaster recovery infrastructure and failovers

This post explains DR strategies like Pilot Light and Active/Passive, and shows how Terraform enables flexible, cost-efficient deployments using conditionals and modular IaC. A working AWS example demonstrates DNS failover and dynamic EC2 provisioning using a toggle variable. This lets teams switch between production and DR environments with minimal effort, reducing downtime and idle resource costs.

📦 Kubernetes & Cloud Native

Gateway API v1.3.0 Adds Smart Mirroring and New Experimental Controls

Gateway API v1.3.0 is now GA with percentage-based request mirroring, letting teams test blue-green deployments without full traffic duplication. The release also debuts experimental support for CORS filters, retry budgets, and listener merging via new X-prefixed APIs. These features help fine-tune request handling, scale listener configs across namespaces, and manage retry spikes, without upgrading Kubernetes itself.

Introducing Gateway API Inference Extension

The new Gateway API Inference Extension introduces model-aware routing for GenAI and LLM services running on Kubernetes. It adds InferenceModel and InferencePool resources to better match requests with the right GPU-backed model server based on real-time load. Early benchmarks show reduced latency under heavy traffic compared to standard Services, helping ops teams optimize resource usage and avoid contention.

Deep Dive into VPA 1.3.0: Smarter Resource Tuning for Kubernetes Pods

This post explores how the Vertical Pod Autoscaler (VPA) v1.3.0 uses historical and real-time metrics to recommend CPU and memory resource requests. It focuses on the Recommender component, which aggregates usage into decaying histograms to auto-tune workloads and reduce resource waste.

Default Helm Charts Leave Kubernetes Clusters at Risk

Microsoft researchers warn that many open-source Helm charts deploy with insecure defaults, exposing services like Apache Pinot, Meshery, and Selenium Grid to the internet without proper authentication. These misconfigurations often include LoadBalancers or NodePorts with no access controls, making them easy targets for attackers. Teams should avoid "plug-and-play" setups and review YAML/Helm configs before deploying to production.

Batch Scheduling in Kubernetes: YuniKorn vs Volcano vs Kueue

Kubernetes lacks native support for batch workloads like ML training and ETL jobs, prompting the rise of tools like Apache YuniKorn, Volcano, and Kueue. YuniKorn replaces the default scheduler with strong multi-tenancy support; Volcano focuses on high-performance use cases with gang scheduling; and Kueue integrates natively to manage job queues without altering core scheduling.

🔍 Observability & SRE

What's new in Grafana v12.0

Grafana v12.0 introduces Git-based dashboard versioning, dynamic layouts, and experimental APIs for managing observability as code. Drilldowns for metrics, logs, and traces are now GA, enabling queryless deep dives across signals. SCIM support simplifies team provisioning, and a new “Recovering” alert state reduces flapping.

Sentry Launches Logs in Open Beta to Boost Debugging Context

Sentry now supports direct log ingestion in open beta, letting developers view application logs alongside errors and traces in a single interface. This integration adds vital context, like retry attempts or upstream responses, to help identify root causes faster without switching tools.

How to use Prometheus to efficiently detect anomalies at scale

Grafana Labs has built and open-sourced an anomaly detection system using only PromQL: no external tools or services required. It computes dynamic bands using rolling averages, standard deviation, and seasonal patterns, with tunable sensitivity and smoothing to reduce false positives. The framework scales across tenants and works with any Prometheus-compatible backend, making it easy to plug into SLO-based alerts for better incident context.

Beyond API uptime: Modern metrics that matter

Traditional uptime checks fall short in today’s fast-paced environments where even minor API delays can cause major user churn. Catchpoint’s Internet Performance Monitoring (IPM) combines global synthetic tests, percentile-based metrics, and user-centric objectives to detect slowdowns before they escalate. With features like API-as-code, chaos engineering, and CI/CD integration, IPM helps teams catch latency issues early and simulate real-world failures.

Microservices Monitoring: Metrics, Challenges, and Tools That Matter

Monitoring microservices requires more than just uptime: it demands insight into latency, throughput, error rates, resource use, and inter-service communication. Tools like Middleware, Prometheus-Grafana, and Dynatrace help track these metrics at scale, support alerting, and simplify root cause analysis. Best practices include centralized logging, distributed tracing, automation, and continuous optimization to maintain performance in complex distributed systems.

Forward to a Friend

📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.

If you have any comments or feedback, just reply back to this email.

Thanks for reading and have a great day!