The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

Ma B, Afzal A, Eitzinger J, Wellein G (2026)


Publication Language: English

Publication Type: Conference contribution

Publication year: 2026

Conference Proceedings Title: Lecture Notes in Computer Science

Event location: Poznań, Poland PL

DOI: 10.48550/arXiv.2605.11999

Abstract

Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.

Authors with CRIS profile

How to cite

APA:

Ma, B., Afzal, A., Eitzinger, J., & Wellein, G. (2026). The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures. In Wyrzykowski, R., Deelman, E. (Eds.), Lecture Notes in Computer Science. Poznań, Poland, PL.

MLA:

Ma, Bole, et al. "The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures." Proceedings of the 16th International Conference on Parallel Processing and Applied Mathematics, PPAM 2026, Poznań, Poland Ed. Wyrzykowski, R., Deelman, E., 2026.

BibTeX: Download