Amazon shut down KiroRank on May 29. If you missed it: KiroRank was the internal leaderboard inside Amazon’s Kiro developer platform that ranked employees by AI tool usage. The idea was to gamify AI adoption. Employees earned PhoneTool badges that showed up next to their names in the internal directory, like video game achievements. People competed.
One employee described it as “the most fun I have had at work.”
Then Amazon shut it down.
Senior VP Dave Treadwell acknowledged the system had been built with “good intentions” but told staff: “Please don’t use AI just for the sake of using AI.” Amazon’s official statement said the dashboard had accomplished its goal of driving AI adoption awareness. Multiple employees told reporters the real reason was different: the leaderboard was easily gamed, and it was burning expensive compute.
Employees figured out that every AI agent task generated tokens, and tokens fed the leaderboard. So they assigned agents to run needless tasks in loops through Kiro and MeshClaw, Amazon’s internal agentic tool. Real work was not what was happening. Compute costs spiked. Amazon killed the scoreboard.
Thanks for reading!
Subscribe to The Product Foundry by Doug Seven to get articles like this in your inbox.
Subscribe on Substack →The detail that sticks with me: one employee said they gamed the leaderboard after being told in a performance review they were not using AI enough. Gaming the metric was not rebellion. It was compliance.
I was the GM for Amazon CodeWhisperer and Q Developer, Amazon’s AI coding assistant, which puts me closer to this story than I would like to admit. When you are building an AI coding tool inside Amazon, the question you cannot escape is: how do you show it is working? Adoption numbers are easy to produce and easy to explain in an exec review. Proving the tool actually changed how software gets built is much harder. KiroRank was an attempt to shortcut that hard problem. Every shortcut finds out the same thing eventually: the hard problem does not go away.
We have been here before
Lines of code. Commits per week. Story points burned. Tickets closed. Engineering has a long history of measuring the wrong thing and then acting surprised when the org optimizes for it. Every metric that becomes a target stops measuring what it was supposed to measure. Goodhart’s Law is not a bug in how humans think. It is a feature.
What changed with agentic AI is the friction. A developer gaming a lines-of-code metric still has to write the code. There is a human in the loop who gets tired. An AI agent gaming a token metric runs all night and never complains. The automation that makes agentic AI powerful is exactly what makes any activity-based metric catastrophically gameable.
KiroRank was always going to fail. It just failed faster than expected.
And Amazon is not alone. At Meta, an employee-built dashboard called “Claudeonomics” ranked roughly 85,000 workers by token consumption. Meta reportedly burned through 60 trillion tokens in a single 30-day period. CTO Andrew Bosworth called his best engineer “5x to 10x more productive” based in part on token spend equivalent to an annual salary. Whether that is signal or noise is left as an exercise for the reader.
The frameworks we have were not built for this
DORA. SPACE. DX Core 4. These are the tools engineering leaders reach for when they want to measure developer productivity. They are not bad frameworks. But they were designed around human developers doing human-paced work, and agentic AI has broken the assumptions underneath them.
Here is what is actually happening. AI tools have collapsed the inner loop of software development: writing code, generating tests, drafting PRs. What used to take days now takes hours. But the outer loop — review, integration testing, validation, deployment — remains constrained by human speed. The result is what researchers are calling “Acceleration Whiplash”: a massive, growing mismatch between how fast code gets written and how fast it gets verified. Telemetry from 22,000 developers across more than 4,000 teams shows PR review times have increased 441% year-over-year. Bugs per developer are up 54%. Production incidents per PR have more than tripled.
DORA is supposed to catch this. It does not, at least not cleanly. The 2025 DORA State of DevOps found that 38% of teams using AI coding tools increased their deployment frequency and simultaneously experienced rising change failure rates. The metric went up. Quality went down. That is not signal. That is noise wearing signal’s clothes.
SPACE adds dimensions around satisfaction, communication, and flow efficiency. Better. But it still assumes the developer is the primary unit of production. In an agentic system, the developer is increasingly the director, not the producer. When deployment frequency, PR volume, and commit counts are driven by agents rather than humans, these frameworks stop telling you whether any of it matters.
The measurement problem is actually a job description problem
In an agentic coding world, the developer’s job is changing faster than our metrics are.
A senior engineer who spent 80% of their time writing code now spends a growing fraction directing agents, reviewing agent output, catching agent errors, and deciding which problems are worth throwing compute at. The coding is still there, but the ratio is shifting fast.
That job does not fit neatly into any existing productivity framework. You cannot measure judgment in commits. You cannot measure taste in deployment frequency. You cannot measure the quality of direction given to an agent in tokens consumed.
A January 2026 Anthropic study found that developers who fully delegated coding to AI scored 50% on comprehension quizzes, compared to 67% for developers who wrote code by hand. The largest gap was on debugging questions. The more you offload to the agent, the worse you get at catching what the agent gets wrong. In an agentic world, the most important thing a developer does might be the hardest thing to count: knowing when the agent is wrong.
What should we actually measure
Amazon’s replacement for KiroRank is something they call “normalized deployments,” designed to capture whether AI-generated code produces useful output and actually ships. Not raw token consumption. That is directionally right. It is also one organization’s first guess, and it will be iterated on.
Here is how I think about what belongs in a metrics model for agentic development:
Deployment quality over deployment frequency. Not just did it ship, but did it stay shipped. Change failure rates weighted by severity, not just counted.
Code Turnover Rate over vibes-based rework. Analysis of 211 million lines of code shows AI-generated code turns over at 1.8x to 2.5x the rate of human-written code. The 30-day Code Turnover Rate measures how much merged code gets rewritten or deleted within a month. An AI-to-human ratio above 1.5x is a specific, measurable signal that agents are being accepted uncritically and nobody is really checking the output. Use it.
Human decision density. How many meaningful human judgments went into what was shipped? Hard to measure directly, possible to proxy through review patterns and intervention rates. The org that cracks a good proxy here will have real signal on whether their engineers are directing or rubber-stamping.
Product outcomes. What moved as a result of what shipped? This is the hard one. It requires connecting engineering metrics to product metrics, and most orgs have not built that bridge. But it is the only measurement genuinely resistant to gaming, because the product does not care how many tokens you spent getting there.
The honest position
We do not have a good measurement framework for agentic software development. The people who will figure it out are inside the organizations building and deploying these systems right now. Amazon’s normalized deployments is one guess. There will be others.
What we know is that any metric built on activity volume will get gamed. Not by bad actors. By rational actors doing exactly what you incentivized. One Amazon employee cheated KiroRank because their manager flagged them in a performance review for not using AI enough. They were not being devious. They were solving the problem in front of them.
That is the real lesson. The problem was not the employees. The problem was a target with no connection to outcomes, attached to career pressure, inside a company spending $200 billion this year on AI infrastructure. Of course it got gamed.
Treadwell was right to tell staff: stop using AI for the sake of using AI. But advice does not fix incentive structures. Metrics do, when they are pointed at the right things.
KiroRank is gone. The question is whether its replacement is actually measuring something real, or whether the industry just learned to build a slightly harder leaderboard to game.








Leave a comment