Network automation keeps failing because your data is a mess
Network teams keep kicking off "source of truth" projects to consolidate scattered data but EMA found these are "long and painful endeavors." The blockers: execs don't get why you need $60k for a database when apps are running fine, your network data lives in spreadsheets and random IPAMs with everyone doing their own thing, and even after you build it engineers keep making CLI changes that drift everything out of sync. The fixes are obvious but hard: get exec buy-in, use discovery tools, integrate with everything, and lock down CLI access until people actually trust it more than their spreadsheets.
Kubernetes 1.35 adds structured debugging endpoints
K8s 1.35 enhances z-pages debugging endpoints like /statusz and /flagz with structured JSON responses instead of just plain text. Now you can programmatically query component state for automated health checks and better debugging tools without parsing text output. Still alpha and requires feature gates, but if you're building internal tooling or want to automate component validation, worth experimenting with in test environments.
Google wants gRPC as an official MCP transport
Model Context Protocol uses JSON-RPC but enterprises running gRPC-based services need transcoding gateways.
So Google's working with the MCP community to support gRPC as a pluggable transport directly in the SDK. gRPC gives you binary encoding (10x smaller messages), full duplex streaming, built-in flow control, mTLS, and method-level authorization.
MCP maintainers agreed to support pluggable transports and Google will contribute a gRPC package soon.
Grafana built AI agents that investigate incidents for you
Grafana's Assistant Investigations deploys specialized AI agents in parallel during incidents. They analyze metrics, logs, traces, and profiles simultaneously to build a comprehensive picture in 13 minutes instead of the 2-4 hours a human takes. Real example: payment service latency issue detected connection pool exhaustion and traced it to a recent deployment in minutes, with zero PromQL knowledge needed.
Conservative estimate saves 50 hours/month of senior engineering time = $90k/year in reclaimed expertise. Free during public preview, worth trying for three weeks to prove ROI.
Self-healing infrastructure is here and it's not about replacing SREs
Autonomous healing infrastructure is running in production serving millions of users, and the difference from past attempts is reasoning capability. Systems can finally understand context and make decisions that used to need human judgment. Most orgs are stuck at Level 2 (automated detection, human fixes) but we've deployed Level 5 (predictive prevention) for specific failure classes. Real results: memory leak auto-remediation in 7 minutes vs 35 minutes with humans, 73% autonomous resolution rate, 81% reduction in after-hours pages.
The architecture needs four pieces: decision engine, safety sandbox, action library, and learning loop. The future of infrastructure is autonomous, question is whether you can afford not to build it.