#llminference search results

Llama3.1-70B now runs at 560 tokens/s @CerebrasSystems is the fastest AI inference and training platform in the world. Llama3.1-70B now runs at 560 tokens/s on Cerebras Inference API. #llama3 #llminference #llms #nlproc #deeplearning


On device ML using Gemma with WebGPU #LLMInference It's working finally 🫣

carrycooldude's tweet image. On device ML using Gemma with WebGPU
#LLMInference
It's working finally 🫣

With support for a variety of PCIe GPUs and configurations, #QuantaGrid D75E-4U is ideal for enterprises seeking to fulfill #GenAI, #LLMInference & #ComputerVision workloads with a cost-performance balanced AI cluster. Visit @QuantaQCT at #SC25 to learn more. #SuperComputing2025

QuantaQCT's tweet image. With support for a variety of PCIe GPUs and configurations, #QuantaGrid D75E-4U is ideal for enterprises seeking to fulfill #GenAI, #LLMInference & #ComputerVision workloads with a cost-performance balanced AI cluster.
Visit @QuantaQCT at #SC25 to learn more. #SuperComputing2025

sujee_dev's tweet image. Nebius-Fast  ~400 t/s 🚀

artificialanalysis.ai/models/deepsee…

#DeepSeekR1 #LLMinference

Speed isn’t luck, it’s good engineering. Nebius AI Studio is the fastest place to run DeepSeek R1 0528 (per @ArtificialAnlys) Blackwell B200 + speculative decoding + smart FP4 on our vertically integrated stack. Sub-second first token. Real throughput. 👇

nebiustf's tweet image. Speed isn’t luck, it’s good engineering.

Nebius AI Studio is the fastest place to run DeepSeek R1 0528 (per @ArtificialAnlys)

Blackwell B200 + speculative decoding + smart FP4 on our vertically integrated stack.

Sub-second first token. Real throughput.

👇


4/5 ⚙️ Cold Start Problem in AI Inference: @charles_irl explains: Serverless = great for bursty use cases, but cold starts add latency. Modal’s stack minimizes cold start times—ideal for production AI. #LLMInference #AIOptimization

schwentker's tweet image. 4/5
⚙️ Cold Start Problem in AI Inference:
@charles_irl explains:

Serverless = great for bursty use cases, but cold starts add latency.

Modal’s stack minimizes cold start times—ideal for production AI.
#LLMInference #AIOptimization

We’re entering a new era for #LLMInference: moving beyond single-node optimizations to distributed serving strategies that unlock better performance, smarter resource utilization, and real cost savings. In this blog post, we break down the latest techniques for distributed LLM…

bentomlai's tweet image. We’re entering a new era for #LLMInference: moving beyond single-node optimizations to distributed serving strategies that unlock better performance, smarter resource utilization, and real cost savings.

In this blog post, we break down the latest techniques for distributed LLM…

Blessing setup a training models blended 3:1!! Model: llama-3.2-1b vs llama-3.2-3b (1) llama-3.2-1b = costs $0.12 per 1M tokens (2) llama-3.2-3b = costs $0.3 per 1M tokens meta-llama/llama-3.2-3b-instruct/fp-16-fast-ollama-1 This worth or not? #llminference #llama #ai #swarm

Gr3yscope's tweet image. Blessing setup a training models blended 3:1!!

Model: llama-3.2-1b vs llama-3.2-3b

(1) llama-3.2-1b = costs $0.12 per 1M tokens
(2) llama-3.2-3b = costs $0.3 per 1M tokens

meta-llama/llama-3.2-3b-instruct/fp-16-fast-ollama-1

This worth or not? #llminference #llama #ai #swarm

This week is the #OCPGlobalSummit & we're there to discuss #LLMinference acceleration! Demos show >2X performance over vLLM, plus: 80-90% higher efficiency 50% TCO improvements 50% lower DC cooling costs 50% less CO2 emissions Reach us at [email protected] if you'd like to meet.

PliopsLtd's tweet image. This week is the #OCPGlobalSummit & we're there to discuss #LLMinference acceleration! Demos show >2X performance over vLLM, plus:

80-90% higher efficiency
50% TCO improvements
50% lower DC cooling costs
50% less CO2 emissions

Reach us at demo@pliops.com if you'd like to meet.

Join us next week at FMS to see how we’re tackling data challenges for #GenAI Apps w/ a preview of our upcoming product, XDP LightningAI, which enhances AI #LLMInference efficiency by 80-90%! Contact us at [email protected] to schedule a time or visit our booth 1045. See you there!

PliopsLtd's tweet image. Join us next week at FMS to see how we’re tackling data challenges for #GenAI Apps w/ a preview of our upcoming product, XDP LightningAI, which enhances AI #LLMInference efficiency by 80-90%! Contact us at demo@pliops.com to schedule a time or visit our booth 1045. See you there!

#FMS2024 (now the Future of Memory and Storage) is coming up next month … and we'll be there! Many great things in store – including a preview of our forthcoming product that enhances AI #LLMinference efficiency by 80-90%, making it the fastest inference solution on the market.

PliopsLtd's tweet image. #FMS2024 (now the Future of Memory and Storage) is coming up next month … and we'll be there! Many great things in store – including a preview of our forthcoming product that enhances AI #LLMinference efficiency by 80-90%, making it the fastest inference solution on the market.

Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲. Three metrics can be used to evaluate the performance of the draft model: - Acceptance rate…

bentomlai's tweet image. Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲.

Three metrics can be used to evaluate the performance of the draft model:
- Acceptance rate…

🚀 "Want to ensure your LLM-powered application runs smoothly in real-time? This guide covers everything from hardware choices to software optimizations that can make a huge difference. buff.ly/3zojOrJ #LLMinference #AI"


Tired of hitting your ChatGPT limit? bentoml.com/blog/chatgpt-u… Even paid users face hidden message caps, downgrades, and throttling. Here’s why those limits exist and how to remove them. Read the full guide 👇 #ChatGPT #LLMInference #OpenSourceAI #BentoML


🎯🌱 • If your evals wander across runs, check batch invariance before hunting seeds. #LLMInference #AIEngineering #Reproducibility #AIArchitecture #GenerativeAI


We extend our sincere thanks to Prof. Sengupta, for an engaging and enriching session!! #ResearchSeminar #LLMInference #DoMSResearch #Academia #KnowledgeSharing #IITK @IITKanpur

DoMS_IITKanpur's tweet image. We extend our sincere thanks to Prof. Sengupta, for an engaging and enriching session!!  

#ResearchSeminar #LLMInference #DoMSResearch #Academia #KnowledgeSharing #IITK @IITKanpur

So if you're loading all those weights anyway, might as well process multiple tokens at once! Draft model predictions often correct for "easy" tokens, so we skip ahead. Can't believe I'm so outdated tf... Karpathy been explaining this since 2023! #AIOptimization #LLMInference


With support for a variety of PCIe GPUs and configurations, #QuantaGrid D75E-4U is ideal for enterprises seeking to fulfill #GenAI, #LLMInference & #ComputerVision workloads with a cost-performance balanced AI cluster. Visit @QuantaQCT at #SC25 to learn more. #SuperComputing2025

QuantaQCT's tweet image. With support for a variety of PCIe GPUs and configurations, #QuantaGrid D75E-4U is ideal for enterprises seeking to fulfill #GenAI, #LLMInference & #ComputerVision workloads with a cost-performance balanced AI cluster.
Visit @QuantaQCT at #SC25 to learn more. #SuperComputing2025

Tired of hitting your ChatGPT limit? bentoml.com/blog/chatgpt-u… Even paid users face hidden message caps, downgrades, and throttling. Here’s why those limits exist and how to remove them. Read the full guide 👇 #ChatGPT #LLMInference #OpenSourceAI #BentoML


🎯🌱 • If your evals wander across runs, check batch invariance before hunting seeds. #LLMInference #AIEngineering #Reproducibility #AIArchitecture #GenerativeAI


Blessing setup a training models blended 3:1!! Model: llama-3.2-1b vs llama-3.2-3b (1) llama-3.2-1b = costs $0.12 per 1M tokens (2) llama-3.2-3b = costs $0.3 per 1M tokens meta-llama/llama-3.2-3b-instruct/fp-16-fast-ollama-1 This worth or not? #llminference #llama #ai #swarm

Gr3yscope's tweet image. Blessing setup a training models blended 3:1!!

Model: llama-3.2-1b vs llama-3.2-3b

(1) llama-3.2-1b = costs $0.12 per 1M tokens
(2) llama-3.2-3b = costs $0.3 per 1M tokens

meta-llama/llama-3.2-3b-instruct/fp-16-fast-ollama-1

This worth or not? #llminference #llama #ai #swarm

This analysis breaks down on-device LLM inference challenges, from compute stages to the unique performance quirks of smartphone storage. - hackernoon.com/why-your-phone… #ondeviceai #llminference


Boost Your LLM Performance: How Stanford’s Optimistic Algorithm Cuts Latency by 5x #LLMInference #AminAlgorithm #ArtificialIntelligence #LatencyReduction #AIOptimization itinai.com/boost-your-llm… The Hidden Bottleneck in LLM Inference In the rapidly evolving landscape of artifi…

vlruso's tweet image. Boost Your LLM Performance: How Stanford’s Optimistic Algorithm Cuts Latency by 5x #LLMInference #AminAlgorithm #ArtificialIntelligence #LatencyReduction #AIOptimization
itinai.com/boost-your-llm…

The Hidden Bottleneck in LLM Inference

In the rapidly evolving landscape of artifi…

Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲. Three metrics can be used to evaluate the performance of the draft model: - Acceptance rate…

bentomlai's tweet image. Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲.

Three metrics can be used to evaluate the performance of the draft model:
- Acceptance rate…

sujee_dev's tweet image. Nebius-Fast  ~400 t/s 🚀

artificialanalysis.ai/models/deepsee…

#DeepSeekR1 #LLMinference

Speed isn’t luck, it’s good engineering. Nebius AI Studio is the fastest place to run DeepSeek R1 0528 (per @ArtificialAnlys) Blackwell B200 + speculative decoding + smart FP4 on our vertically integrated stack. Sub-second first token. Real throughput. 👇

nebiustf's tweet image. Speed isn’t luck, it’s good engineering.

Nebius AI Studio is the fastest place to run DeepSeek R1 0528 (per @ArtificialAnlys)

Blackwell B200 + speculative decoding + smart FP4 on our vertically integrated stack.

Sub-second first token. Real throughput.

👇


We extend our sincere thanks to Prof. Sengupta, for an engaging and enriching session!! #ResearchSeminar #LLMInference #DoMSResearch #Academia #KnowledgeSharing #IITK @IITKanpur

DoMS_IITKanpur's tweet image. We extend our sincere thanks to Prof. Sengupta, for an engaging and enriching session!!  

#ResearchSeminar #LLMInference #DoMSResearch #Academia #KnowledgeSharing #IITK @IITKanpur

sujee_dev's tweet image. Nebius-Fast  ~400 t/s 🚀

artificialanalysis.ai/models/deepsee…

#DeepSeekR1 #LLMinference

Speed isn’t luck, it’s good engineering. Nebius AI Studio is the fastest place to run DeepSeek R1 0528 (per @ArtificialAnlys) Blackwell B200 + speculative decoding + smart FP4 on our vertically integrated stack. Sub-second first token. Real throughput. 👇

nebiustf's tweet image. Speed isn’t luck, it’s good engineering.

Nebius AI Studio is the fastest place to run DeepSeek R1 0528 (per @ArtificialAnlys)

Blackwell B200 + speculative decoding + smart FP4 on our vertically integrated stack.

Sub-second first token. Real throughput.

👇


We’re entering a new era for #LLMInference: moving beyond single-node optimizations to distributed serving strategies that unlock better performance, smarter resource utilization, and real cost savings. In this blog post, we break down the latest techniques for distributed LLM…

bentomlai's tweet image. We’re entering a new era for #LLMInference: moving beyond single-node optimizations to distributed serving strategies that unlock better performance, smarter resource utilization, and real cost savings.

In this blog post, we break down the latest techniques for distributed LLM…

This week is the #OCPGlobalSummit & we're there to discuss #LLMinference acceleration! Demos show >2X performance over vLLM, plus: 80-90% higher efficiency 50% TCO improvements 50% lower DC cooling costs 50% less CO2 emissions Reach us at [email protected] if you'd like to meet.

PliopsLtd's tweet image. This week is the #OCPGlobalSummit & we're there to discuss #LLMinference acceleration! Demos show >2X performance over vLLM, plus:

80-90% higher efficiency
50% TCO improvements
50% lower DC cooling costs
50% less CO2 emissions

Reach us at demo@pliops.com if you'd like to meet.

Join us next week at FMS to see how we’re tackling data challenges for #GenAI Apps w/ a preview of our upcoming product, XDP LightningAI, which enhances AI #LLMInference efficiency by 80-90%! Contact us at [email protected] to schedule a time or visit our booth 1045. See you there!

PliopsLtd's tweet image. Join us next week at FMS to see how we’re tackling data challenges for #GenAI Apps w/ a preview of our upcoming product, XDP LightningAI, which enhances AI #LLMInference efficiency by 80-90%! Contact us at demo@pliops.com to schedule a time or visit our booth 1045. See you there!

#FMS2024 (now the Future of Memory and Storage) is coming up next month … and we'll be there! Many great things in store – including a preview of our forthcoming product that enhances AI #LLMinference efficiency by 80-90%, making it the fastest inference solution on the market.

PliopsLtd's tweet image. #FMS2024 (now the Future of Memory and Storage) is coming up next month … and we'll be there! Many great things in store – including a preview of our forthcoming product that enhances AI #LLMinference efficiency by 80-90%, making it the fastest inference solution on the market.

On device ML using Gemma with WebGPU #LLMInference It's working finally 🫣

carrycooldude's tweet image. On device ML using Gemma with WebGPU
#LLMInference
It's working finally 🫣

Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲. Three metrics can be used to evaluate the performance of the draft model: - Acceptance rate…

bentomlai's tweet image. Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲.

Three metrics can be used to evaluate the performance of the draft model:
- Acceptance rate…

Pliops is pleased to announce that our XDP LightningAI has received a  “Best of Show” Award from FMS in the #LLMInference Acceleration Category! Additionally, the newest iteration of our Extreme Data Processor, the XDP PRO, is our first ASIC-based chip. bit.ly/3LVftyD

PliopsLtd's tweet image. Pliops is pleased to announce that our XDP LightningAI has received a  “Best of Show” Award from FMS in the #LLMInference Acceleration Category! Additionally, the newest iteration of our Extreme Data Processor, the XDP PRO, is our first ASIC-based chip. bit.ly/3LVftyD

With support for a variety of PCIe GPUs and configurations, #QuantaGrid D75E-4U is ideal for enterprises seeking to fulfill #GenAI, #LLMInference & #ComputerVision workloads with a cost-performance balanced AI cluster. Visit @QuantaQCT at #SC25 to learn more. #SuperComputing2025

QuantaQCT's tweet image. With support for a variety of PCIe GPUs and configurations, #QuantaGrid D75E-4U is ideal for enterprises seeking to fulfill #GenAI, #LLMInference & #ComputerVision workloads with a cost-performance balanced AI cluster.
Visit @QuantaQCT at #SC25 to learn more. #SuperComputing2025

#GenerativeAI is gaining traction. However, deploying AI, especially #LLMinference, needs substantial #GPU and memory resources, increasing power usage & emissions. Our Extreme Data Processor uses GPU key-value I/O to cut power usage & emissions.

PliopsLtd's tweet image. #GenerativeAI is gaining traction. However, deploying AI, especially #LLMinference, needs substantial #GPU and memory resources, increasing power usage & emissions. Our Extreme Data Processor uses GPU key-value I/O to cut power usage & emissions.

We extend our sincere thanks to Prof. Sengupta, for an engaging and enriching session!! #ResearchSeminar #LLMInference #DoMSResearch #Academia #KnowledgeSharing #IITK @IITKanpur

DoMS_IITKanpur's tweet image. We extend our sincere thanks to Prof. Sengupta, for an engaging and enriching session!!  

#ResearchSeminar #LLMInference #DoMSResearch #Academia #KnowledgeSharing #IITK @IITKanpur

Check out this video narrated by our chief AI Scientist Eshcar Hillel, in which she explains how our XDP LightningAI solution accelerates LLM inferencing to drive data center efficiency. Watch here: bit.ly/3YX4b31 #SC24 #LLMInference #DataCenterEfficiency #GenAI

PliopsLtd's tweet image. Check out this video narrated by our chief AI Scientist Eshcar Hillel, in which she explains how our XDP LightningAI solution accelerates LLM inferencing to drive data center efficiency. Watch here: bit.ly/3YX4b31

#SC24 #LLMInference #DataCenterEfficiency #GenAI

4/5 ⚙️ Cold Start Problem in AI Inference: @charles_irl explains: Serverless = great for bursty use cases, but cold starts add latency. Modal’s stack minimizes cold start times—ideal for production AI. #LLMInference #AIOptimization

schwentker's tweet image. 4/5
⚙️ Cold Start Problem in AI Inference:
@charles_irl explains:

Serverless = great for bursty use cases, but cold starts add latency.

Modal’s stack minimizes cold start times—ideal for production AI.
#LLMInference #AIOptimization

Blessing setup a training models blended 3:1!! Model: llama-3.2-1b vs llama-3.2-3b (1) llama-3.2-1b = costs $0.12 per 1M tokens (2) llama-3.2-3b = costs $0.3 per 1M tokens meta-llama/llama-3.2-3b-instruct/fp-16-fast-ollama-1 This worth or not? #llminference #llama #ai #swarm

Gr3yscope's tweet image. Blessing setup a training models blended 3:1!!

Model: llama-3.2-1b vs llama-3.2-3b

(1) llama-3.2-1b = costs $0.12 per 1M tokens
(2) llama-3.2-3b = costs $0.3 per 1M tokens

meta-llama/llama-3.2-3b-instruct/fp-16-fast-ollama-1

This worth or not? #llminference #llama #ai #swarm

Boost Your LLM Performance: How Stanford’s Optimistic Algorithm Cuts Latency by 5x #LLMInference #AminAlgorithm #ArtificialIntelligence #LatencyReduction #AIOptimization itinai.com/boost-your-llm… The Hidden Bottleneck in LLM Inference In the rapidly evolving landscape of artifi…

vlruso's tweet image. Boost Your LLM Performance: How Stanford’s Optimistic Algorithm Cuts Latency by 5x #LLMInference #AminAlgorithm #ArtificialIntelligence #LatencyReduction #AIOptimization
itinai.com/boost-your-llm…

The Hidden Bottleneck in LLM Inference

In the rapidly evolving landscape of artifi…

Looking for game-changing efficiency for #LLMinference? Check out XDP LightningAI! This #KeyValue distributed #smartstorage node boosts LLM inference efficiency by 80-90%, cuts TCO by 50%, reduces CO2 emissions by 50%, & slashes DC cooling costs by 50%. bit.ly/3zmW0Va

PliopsLtd's tweet image. Looking for game-changing efficiency for #LLMinference? Check out XDP LightningAI! This #KeyValue distributed #smartstorage node boosts LLM inference efficiency by 80-90%, cuts TCO by 50%, reduces CO2 emissions by 50%, & slashes DC cooling costs by 50%. bit.ly/3zmW0Va

Loading...

Something went wrong.


Something went wrong.


United States Trends