CloudPro #115

Elevate Your Cloud Security Strategy with Dark Reading

how-google-built-a-kubernetes-cluster-with-130000-nodes-img-0

As a cloud security professional, you need cutting-edge insights to stay ahead of evolving vulnerabilities. The Dark Reading daily newsletter provides in-depth analysis of cloud vulnerabilities, advanced threat detection, and risk mitigation strategies. Stay informed on zero-trust architecture, compliance frameworks, and securing complex multi-cloud and hybrid environments.

Signup to newsletter

In today's issue, we'll look at: Google pushes Kubernetes to 130K nodes (yes, really), AWS launches an AI agent that debugs production while you sleep, and the network jobs market sends mixed signals: AI certs pay 12% more while automation threatens to eliminate a fifth of IT roles. Plus, hard lessons from recent AWS and Cloudflare outages that went global from single subsystem failures.

Cheers,

Shreyans Singh

Editor-in-Chief

Ransomware Just Hit your AWS Cloud. What Happens Next?

how-google-built-a-kubernetes-cluster-with-130000-nodes-img-1

Join us for an immersive simulation that’ll let you experience a fictionalized ransomware attack, without any of the actual consequences.

You'll witness:
-The first suspicious alert
-The shocking depth of the breach
-A heart-stopping realization about compromised backups
-The impossible choice: pay or rebuild?

Don't just hear about ransomware. Experience it. Learn how to be truly cyber resilient.

Save My Spot

This Week in Cloud

How Google Built a Kubernetes Cluster with 130,000 nodes

Google has been testing GKE at 130,000 nodes, twice their official support limit. They're hitting 1,000 pods/sec scheduling throughput with P99 startup under 10 seconds, used Kueue to preempt 39K pods in 93 seconds when priorities shifted, and kept the control plane stable with 1M+ objects in the datastore. The architectural wins: consistent reads from cache (KEP-2340), snapshottable API server cache (KEP-4988), and Spanner-backed storage handling 13K QPS just for lease updates.

This matters because we're moving from chip-limited to power-limited infrastructure. One GB200 pulls 2.7KW, so at 100K+ nodes you're talking hundreds of megawatts across multiple data centers. Google's betting on multi-cluster orchestration becoming the norm (MultiKueue, managed DRANET). Gang scheduling via Kueue now, native Kubernetes support coming (KEP-4671).

Sneak Peek into Kubernetes v1.35

K8s 1.35 is finally killing off cgroup v1 support. If you're still running nodes on ancient distros without cgroup v2, your kubelet won't start. Also deprecating ipvs mode in kube-proxy since maintaining feature parity became impossible; nftables is the way forward on Linux.

On the features side: in-place pod resource updates hitting GA (no more pod restarts for cpu/memory changes), native pod certificates for mTLS without needing SPIFFE/SPIRE, numeric comparisons for taints (finally can do SLA-based scheduling with Gt/Lt operators), user namespaces maturing through beta (container root remapped to unprivileged host UID), and image volumes likely enabled by default (mount OCI artifacts directly as volumes). Node declared features are going alpha too - nodes will publish supported capabilities to avoid version skew scheduling failures.

AWS launched a DevOps Agent that actually debugs production for you

No more 3am war rooms might actually be realistic now.

DevOps Agent is a "frontier agent" that runs autonomously for hours investigating incidents while you sleep. It connects to CloudWatch, Datadog, Dynatrace, GitHub/GitLab, ServiceNow, and builds an application topology map automatically. When stuff breaks, it correlates metrics/logs/deployments, identifies root causes, updates Slack channels, and suggests mitigations. It has a web app for operators to manually trigger investigations or steer the agent mid-investigation.

The interesting part: it analyzes past incidents to recommend systematic improvements (multi-AZ gaps, monitoring coverage, deployment pipeline issues). It also creates detailed mitigation specs that work with agentic dev tools. Supports custom tool integration via MCP servers for your internal systems.

AWS will manage your Argo CD, ACK, and KRO now

AWS just launched EKS Capabilities: fully managed versions of Argo CD, AWS Controllers for Kubernetes (ACK), and Kube Resource Orchestrator (KRO) that run in AWS-owned accounts, not your cluster. They handle scaling, patching, upgrades, and breaking change analysis automatically. SSO with IAM Identity Center for Argo CD, ACK has resource adoption for migrating from Terraform/CloudFormation, KRO for building reusable resource bundles.

This is basically AWS saying "stop running your own GitOps infrastructure." Makes sense given 45% of K8s users already run Argo CD in production (per 2024 CNCF survey).

how-google-built-a-kubernetes-cluster-with-130000-nodes-img-2

Early Bird Offer: Get 40% Off

Use code EARLY40

how-google-built-a-kubernetes-cluster-with-130000-nodes-img-3

Early Bird Offer: Get 40% Off

Use code EARLY40

Deep Dive

Controlling Kubernetes Network Traffic

Ingress NGINX is retiring and it got me thinking about how convoluted network traffic control has become in Kubernetes. You've got your CNI for connectivity, network policies for security, ingress controllers or Gateway API for north-south routing, maybe a service mesh for east-west traffic, and honestly most apps don't need all of this. The real decision most people face is simpler: ingress controller vs Gateway API.

Here's the thing: if you just need basic HTTP/HTTPS routing and you're already comfortable with nginx or Traefik, stick with ingress controllers. They work, they're stable, tooling is mature. Gateway API makes sense if you need advanced stuff like protocol-agnostic routing, cross-namespace setups, or you're running multi-team environments where role separation matters. All three clouds (AWS ALB Controller, Azure AGIC, GKE Ingress) have solid managed options for both approaches now. Gateway API is clearly the future, but "future-proof" doesn't mean you need to migrate today.

Network jobs roundup: AI certs pay, skills gap persists, mixed employment signals

The network jobs market is weird right now. AI certifications are commanding 12% higher pay year-over-year while overall IT skills premiums dropped 0.7%. CompTIA just launched AI Infrastructure and AITECH certs, Cisco added wireless-only tracks (CCNP/CCIE Wireless launching March 2026). Meanwhile unemployment for tech workers sits at 2.5-3% depending on who's counting, but large enterprises keep announcing layoffs while small/midsize companies are actually hiring.

Skills gap is real though- 68% of orgs say they're understaffed in AI/ML ops, 65% in cybersecurity. Telecom lost 59% of positions to automation, and survey data shows 18-22% of IT workforce could be eliminated by AI in the next 5 years. But demand for AI/ML, cloud architecture, and security skills keeps growing. The takeaway: upskill in AI and automation or get left behind, especially if you're in support, help desk, or legacy infrastructure roles.

Three Lessons from the Recent AWS and Cloudflare Outages

AWS US-EAST-1 went down for 15 hours in October (DNS race condition in DynamoDB), Cloudflare ate it in November (oversized Bot Management config file crashed proxies globally). Both followed the same pattern: small defect in one subsystem cascaded everywhere. The lessons are obvious but worth repeating: design out single points of failure with multi-region/multi-cloud by default, use AI-powered monitoring to correlate signals and automate rollback (monitoring without automated response is just expensive alerting), and actually practice your DR plan regularly because you fall to the level of your practice, not rise to your runbook.

The deeper point: complexity keeps growing with every new region and service, multiplying ways a small change can blow up globally. The answer is designing for failure: limit blast radius, decouple planes, automate validation. No provider is immune, so your architecture needs to assume failures will happen and route around them automatically.

Test your DR plan with chaos engineering, not hope- Google SRE Practice Lead

Google's SRE team wrote a piece on why your disaster recovery plan probably doesn't work and how chaos engineering proves it. The premise: systems change constantly (microservices, config updates, API dependencies), so that DR doc you wrote last quarter is already outdated. Chaos engineering lets you run controlled experiments—simulate database failovers, regional outages, resource exhaustion, and measure if you actually meet your SLOs during the disaster.

It's not about breaking things randomly. You define steady state, form a hypothesis (like "traffic will failover to secondary region in 3 minutes with <1% errors"), inject a specific failure, and measure what happens. The key insight is connecting chaos to SLOs. Traditional DR drills might "pass" because backup systems came online, but if it took 20 minutes and burned your entire error budget, customers saw you as down. Start small with one timeout or retry test, build confidence, scale from there.

Stelvio: AWS for Python devs

Stelvio is a Python framework that lets you define AWS infrastructure in pure Python with smart defaults handling the annoying bits. Run stlv init, write your infra in Python (DynamoDB tables, Lambda functions, API Gateway routes), hit stlv deploy and you're done. No Terraform, no CDK yaml hell, no mixing infrastructure code with application code.

📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.

If you have any comments or feedback, just reply back to this email.

Thanks for reading and have a great day!