How attention offloading reduces the costs of LLM inference at scale

by | May 14, 2024 | Technology

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.

Rearranging the computations and hardware used to serve large language models (LLMs) can considerably reduce the costs of inference, according to a new study by researchers at Tsinghua University. The study introduces “attention offloading,” a technique that uses lower-priced GPUs to handle memory-intensive operations while reserving the more expensive, compute-optimized accelerators for other tasks.

With high-end AI accelerators being expensive, scarce, and in high demand, techniques such as attention offloading can help companies make better use of their available hardware when serving LLMs at scale.

Two types of computations

LLM inference is a complicated process that involves different types of operations. The key to optimizing inference is to arrange these operations in a way that makes the best use of the memory and compute resources of the hardware accelerators.

From a resource perspective, the operations that take place during inference fall into two main categories. Some of them are computation-bound and can benefit from faster accelerators such as A100 and H100. Others, however, are memory-bound, which means they just need more video RAM (VRAM) capacity. This is particularly true for the self-attention operation that takes place for each new token generated by the model.

VB Event
The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensur …

Article Attribution | Read More at Article Source

[mwai_chat context=”Let’s have a discussion about this article:nn
Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.

Rearranging the computations and hardware used to serve large language models (LLMs) can considerably reduce the costs of inference, according to a new study by researchers at Tsinghua University. The study introduces “attention offloading,” a technique that uses lower-priced GPUs to handle memory-intensive operations while reserving the more expensive, compute-optimized accelerators for other tasks.

With high-end AI accelerators being expensive, scarce, and in high demand, techniques such as attention offloading can help companies make better use of their available hardware when serving LLMs at scale.

Two types of computations

LLM inference is a complicated process that involves different types of operations. The key to optimizing inference is to arrange these operations in a way that makes the best use of the memory and compute resources of the hardware accelerators.

From a resource perspective, the operations that take place during inference fall into two main categories. Some of them are computation-bound and can benefit from faster accelerators such as A100 and H100. Others, however, are memory-bound, which means they just need more video RAM (VRAM) capacity. This is particularly true for the self-attention operation that takes place for each new token generated by the model.

VB Event
The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensur …nnDiscussion:nn” ai_name=”RocketNews AI: ” start_sentence=”Can I tell you more about this article?” text_input_placeholder=”Type ‘Yes'”]

Share This