Microsoft drops ‘MInference’ demo, challenges status quo of AI processing

by | Jul 8, 2024 | Technology

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More

Microsoft unveiled an interactive demonstration of its new MInference technology on the AI platform Hugging Face on Sunday, showcasing a potential breakthrough in processing speed for large language models. The demo, powered by Gradio, allows developers and researchers to test Microsoft’s latest advancement in handling lengthy text inputs for artificial intelligence systems directly in their web browsers.

MInference, which stands for “Million-Tokens Prompt Inference,” aims to dramatically accelerate the “pre-filling” stage of language model processing — a step that typically becomes a bottleneck when dealing with very long text inputs. Microsoft researchers report that MInference can slash processing time by up to 90% for inputs of one million tokens (equivalent to about 700 pages of text) while maintaining accuracy.

“The computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens on a single [Nvidia] A100 GPU,” the research team noted in their paper published on arXiv. “MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.”

Microsoft’s MInference demo shows performance comparisons between standard LLaMA-3-8B-1M and the MInference-optimized version. The video highlights an 8.0x latency speedup for processing 776,000 tokens on an Nvidia A100 80GB GPU, with inference times reduced from 142 seconds to 13.9 seconds. (Credit: hqjiang.com)

H …

Article Attribution | Read More at Article Source

[mwai_chat context=”Let’s have a discussion about this article:nn
We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More

Microsoft unveiled an interactive demonstration of its new MInference technology on the AI platform Hugging Face on Sunday, showcasing a potential breakthrough in processing speed for large language models. The demo, powered by Gradio, allows developers and researchers to test Microsoft’s latest advancement in handling lengthy text inputs for artificial intelligence systems directly in their web browsers.

MInference, which stands for “Million-Tokens Prompt Inference,” aims to dramatically accelerate the “pre-filling” stage of language model processing — a step that typically becomes a bottleneck when dealing with very long text inputs. Microsoft researchers report that MInference can slash processing time by up to 90% for inputs of one million tokens (equivalent to about 700 pages of text) while maintaining accuracy.

“The computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens on a single [Nvidia] A100 GPU,” the research team noted in their paper published on arXiv. “MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.”

Microsoft’s MInference demo shows performance comparisons between standard LLaMA-3-8B-1M and the MInference-optimized version. The video highlights an 8.0x latency speedup for processing 776,000 tokens on an Nvidia A100 80GB GPU, with inference times reduced from 142 seconds to 13.9 seconds. (Credit: hqjiang.com)

H …nnDiscussion:nn” ai_name=”RocketNews AI: ” start_sentence=”Can I tell you more about this article?” text_input_placeholder=”Type ‘Yes'”]

Share This