Quantization Methods for Energy Efficient LLM Deployments

Authors

DOI:

https://doi.org/10.34190/icair.5.1.4367

Keywords:

LLMs, Quantization, Enenrgy efficiency, Model compression, Inference optimization

Abstract

The deployment of large language models (LLMs) in production environments faces significant challenges due to computational and energy requirements during inference. This paper presents a comprehensive empirical analysis of quantization methods applied to the Qwen3 model family, ranging from 0.6B to 32B parameters. We evaluate six quantization approaches: GPTQ 4-bit, GPTQ 8-bit, AWQ, FP8 W8A8 and INT8 W8A8, and the original FP16 baseline across six established benchmarks (MMLU, HumanEval, TruthfulQA, MetaBench, GSM8K, ARC Challenge). Our analysis examines the relationship between model size, quantization method, accuracy preservation, energy consumption, and inference performance across various context lengths. We demonstrate that larger Qwen3 models exhibit increased resilience to quantization-induced accuracy degradation, while aggressive quantization methods provide substantial energy savings with acceptable trade-offs in model performance. These findings provide crucial insights for optimizing LLM deployments in resource-constrained environments.

Author Biography

Tomislav Subic, University of Trieste and Arctur

Tomislav Šubić is Head of AI at Arctur, where he leads cross-functional AI teams combining HPC with AI across healthcare, tourism, industrial optimization and manufacturing. He is also pursuing a PhD in Applied AI at the University of Trieste specializing in LLM optimization, quantization, and efficient inference.

Downloads

Published

2025-12-04