LLM Expert Insights, Packt
30 Jun 2026
10 min read
Why SUSE rejects lines of code, token counts, and developer leaderboards AI_Distilled #142: What’s New in AI This Week Social engineering is about manipulating people's emotions. Identify the susceptibilities that hackers use to exploit people. This NINJIO Insights Report dives into the key emotional susceptibilities that make social engineering work and offers concrete steps that your security team can take to equip your workforce to resist cyberattacks. Download the Guide Instead of our usual mix of AI news and analysis, today’s issue is dedicated to a single conversation. As AI coding agents become part of everyday engineering, a new wave of dashboards is emerging to measure their impact. Lines of code, token consumption, pull requests, utilization scores, and developer rankings are quickly becoming the default language of AI productivity. Rick Spencer, General Manager for Technology and Product at SUSE, argues that much of it is measuring the wrong thing. In today’s special issue, he explains why output is a poor proxy for value, why engineering leaderboards create the wrong incentives, and how SUSE measures AI through customer outcomes instead of developer activity. If you’re thinking about how to evaluate AI inside an engineering organization, this is a conversation worth spending time with. LLM Expert Insights, Packt P.S. If the following article ends up being your kind of read, you’ll probably enjoy Agentic Engineering, our new publication for builders navigating AI beyond the demo. Early subscribers are still receiving a free copy of AI Agents in Practice by Valentina Alto. Subscribe Here SUSE refuses to measure its engineers by how much code their agents write Rick Spencer on why output, tokens, and lines of code tell you nothing, and what an open-source enterprise tracks instead. As AI agents move into engineering workflows, new leaderboard metrics are tracking lines of code submitted, tokens consumed, and per-developer utilization. If agents are generating output, then output should be measured, compared across engineers, and ranked. Rick Spencer, General Manager for Technology and Product at SUSE, has looked hard at how the industry is measuring AI’s effect on engineering. “I consider that garbage vanity metrics,” he says, calling them unhelpful. His argument for what to track instead is one of the more clarifying things an engineering leader can hear right now, because it separates the numbers that look like progress from the numbers that actually represent it. Output is cheap; impact is what counts The core of Spencer’s position is a distinction between output and impact, and it matters because the two come apart precisely when AI enters the picture. AI makes output cheap as the lines of code, pull requests, and token counts all mount up when agents are doing the writing, which makes them exactly the wrong thing to measure if what you care about is value delivered. “We’re really tending away from measurements that measure output and utilization, and we’re trying to focus on impact,” he says. A leaderboard that ranks engineers by how much their agents produced does not tell you who is solving the hardest problems or keeping customers safe. It tells you who is generating the most volume, and in an AI-assisted world, that number is close to meaningless. There is also a structural reason the standard tooling does not fit SUSE, and it applies to more organizations than it might first appear. Much of the available measurement tooling assumes a particular shape of company. “They really assume you’re a proprietary software company where everyone’s working on a single code base,” Spencer explains, “which is just not how an open-source enterprise works.” His engineers work across hundreds, sometimes thousands, of repositories, where the maintenance work on each one differs enormously. A per-developer comparison across that landscape measures the shape of the work far more than it measures the contribution of the engineer, which is why he treats developer-to-developer comparison as fundamentally low value rather than merely imperfect. The reporting burden itself is part of his objection, and it is a point leaders setting up AI dashboards should sit with carefully. A measurement regime that requires engineers to generate weekly utilization reports spends the very time it claims to be optimizing. “I’d rather have them working than reporting,” Spencer says. The instrument meant to measure productivity eventually becomes a tax on it. What SUSE tracks instead Rejecting vanity metrics only helps if there is something better to put in their place. And Spencer shares how SUSE measures business impact in terms that connect directly to what customers actually receive. “How fast are CVEs being addressed, how fast are patches being backported, how fast are our L3 responses getting closed while maintaining the same NPS score,” he underscores, listing what his teams track. The common thread is that each one is an outcome the customer feels, not an activity the engineer performs. AI has been applied to exactly these areas, so measuring the speed and quality of those outcomes tells you whether the AI is doing anything worth its cost, which is the actual question worth asking. This shift from output to outcome reframes what a metric is for in the first place. A CVE response time captures whether the organization is keeping customers safe faster than it used to. A backport speed captures whether stable releases are getting their fixes without the manual grind that used to gate them. These numbers move because the underlying work got genuinely better, not because more text was generated, and that is the property that makes them trustworthy. They are also far harder to game, because the only way to improve them is to actually improve the thing the customer depends on. Give managers visibility, not a leaderboard None of this means SUSE ignores cost or utilization entirely, and the distinction Spencer draws here is the one that keeps the approach from collapsing into either negligence or surveillance. The company is building dashboards that give engineering managers visibility into their team’s cost and utilization, but the purpose is coaching rather than ranking. The unit of analysis is the team, and the question it answers is diagnostic. Spencer gives the example of a manager with an eight-person team noticing the numbers and asking the right kind of question. “We’re burning a lot of tokens. What are we actually doing that’s burning that many tokens? I’m not sure we’re getting value out of that.” The inverse matters just as much, where purchased seats for a code assistant sit unused, and the manager asks whether there are places the team should be drawing value that it is currently leaving on the table. The governance side of that picture, including how SUSE keeps agents and their costs inside a boundary it can stand behind, is covered in a companion piece, How SUSE Runs AI Without Losing Control. The difference between this and a leaderboard is not subtle, and it is the heart of the leadership lesson. A leaderboard exposes individuals and turns measurement into a game engineers play against each other, a game Spencer is explicit has nothing to do with customer value. Team-level cost visibility used for coaching does the opposite. It gives a manager the information to guide the team toward better use of the tools without making any individual engineer feel watched. “We’re really trying to decentralize and allow engineering managers to guide their teams on getting the most value out of the AI,” he says, “without it becoming like a leaderboard game where developers feel like they’re exposed.” The data exists to help the manager help the team, not to rank the team against itself. The principle holding it together What makes Spencer’s approach more than a list of preferred numbers is the principle holding it together, which is that measurement should serve the work rather than distort it. Every choice he describes follows from that one idea. Impact comes before output because output is the thing AI inflates. Team-level diagnostics come before individual leaderboards, because the goal is coaching rather than competition. Business outcomes come before activity counts, because outcomes are what customers actually receive. The decentralization to engineering managers reflects the same conviction that the people closest to the work are best placed to judge whether the AI is helping, given the right information and trusted to use it well. The deeper point for any leader standing up AI measurement is that the easy numbers and the useful numbers are not the same, and AI has widened the gap between them. The figures that are simplest to collect, lines of code, tokens, and per-head utilization, are the ones AI has made least meaningful. The figures that matter, the speed and quality of the outcomes customers depend on, take more thought to define and more care to track. Spencer’s argument is that the effort is the job. “Let’s focus on the impact,” he says, “the business impact, not on the utilization.” For engineering leaders deciding what belongs on a dashboard as agents reshape their teams, that is the distinction worth getting right before the vanity metrics calcify into the way the organization sees itself. If this article made you think, you’ll probably enjoy our new publication, AgenticEngineering. We’re less interested in asking whether agents can do something, and more interested in what it takes to make them work reliably in production. Early subscribers are still receiving a free copy of AI Agents in Practice by Valentina Alto. Subscribe and grab your free e-book instantly Explore Before Time Runs Out Built something cool? Tell us. Whether it's a scrappy prototype or a production-grade agent, we want to hear how you're putting generative AI to work. Drop us your story at nimishad@packtpub.com or reply to this email, and you could get featured in an upcoming issue of AI_Distilled. 📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us. If you have any comments or feedback, just reply back to this email. Thanks for reading and have a great day! That’s a wrap for this week’s edition of AI_Distilled 🧠⚙️ We would love to know what you thought—your feedback helps us keep leveling up. 👉 Drop your rating here Thanks for reading, The AI_Distilled Team (Curated by humans. Powered by curiosity.) *{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;display:none;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.social_block .social-table{display:inline-block!important}.row .side{display:none}}
Read more