Quantization Methods for Energy Efficient LLM Deployments
DOI:
https://doi.org/10.34190/icair.5.1.4367Keywords:
LLMs, Quantization, Enenrgy efficiency, Model compression, Inference optimizationAbstract
The deployment of large language models (LLMs) in production environments faces significant challenges due to computational and energy requirements during inference. This paper presents a comprehensive empirical analysis of quantization methods applied to the Qwen3 model family, ranging from 0.6B to 32B parameters. We evaluate six quantization approaches: GPTQ 4-bit, GPTQ 8-bit, AWQ, FP8 W8A8 and INT8 W8A8, and the original FP16 baseline across six established benchmarks (MMLU, HumanEval, TruthfulQA, MetaBench, GSM8K, ARC Challenge). Our analysis examines the relationship between model size, quantization method, accuracy preservation, energy consumption, and inference performance across various context lengths. We demonstrate that larger Qwen3 models exhibit increased resilience to quantization-induced accuracy degradation, while aggressive quantization methods provide substantial energy savings with acceptable trade-offs in model performance. These findings provide crucial insights for optimizing LLM deployments in resource-constrained environments.