Publications
IrEne: Interpretable Energy Prediction for Transformers
Transformer Models Energy prediction
Authors: Qingqing Cao, Yash Kumar Lal, Harsh Trivedi, Aruna Balasubramanian, Niranjan Balasubramanian
Year: 2021
Published in: IJCNLP.
Read me: DOI: 10.18653/v1/2021.acl-long.167. 👩💻Replication package.
Abstract: Existing software-based energy measurements of NLP models are not accurate because they do not consider the complex interactions between energy consumption and model execution. We present IrEne, an interpretable and extensible energy prediction system that accurately predicts the inference energy consumption of a wide range of Transformer-based NLP models. IrEne constructs a model tree graph that breaks down the NLP model into modules that are further broken down into low-level machine learning (ML) primitives. IrEne predicts the inference energy consumption of the ML primitives as a function of generalizable features and fine-grained runtime resource usage. IrEne then aggregates these low-level predictions recursively to predict the energy of each module and finally of the entire model. Experiments across multiple Transformer models show IrEne predicts inference energy consumption of transformer models with an error of under 7% compared to the ground truth. In contrast, existing energy models see an error of over 50%. We also show how IrEne can be used to conduct energy bottleneck analysis and to easily evaluate the energy impact of different architectural choices. We release the code and data at https://github.com/StonyBrookNLP/irene.
Bibtex (copy):Annotation
The paper proposes a model to estimate the energy consumption of Transformer-based NLP models. They achieve this by abstracting the model into a tree, dividing its components into modules (e.g. BertSelfAttention), which can be composed by other submodules, Machine Learning primitives (e.g. LayerNorm), and these are formed by math operations (e.g matrix multiplication).
The model is run once using just-in-time (JIT) instrumentation to build the tree using Pytorch API and extract relevant features from each of the nodes. These features are hardware-independent, like batch-size or floating point operations, and hardware-dependent, like GPU clock speed or GPU driver energy. Regression models are trained for each of the leaves of the tree using these features, and the estimation for parent nodes is computed bottom-up, using a weighted sum that is learned using node features.