#inferenceoptimization search results

JohnSnowLabs

Oct 8

Talk: From Human Agents to GPU-Powered GenAI – A Data-Driven Transformation in Customer Service 🔗 hubs.li/Q03LTv9v0 Register Now: hubs.li/Q03LTrqr0 #GenAIinSupport #CustomerServiceAI #InferenceOptimization #EnterpriseAI #AppliedAISummit

JohnSnowLabs's tweet image. Talk: From Human Agents to GPU-Powered GenAI – A Data-Driven Transformation in Customer Service

🔗 hubs.li/Q03LTv9v0

Register Now: hubs.li/Q03LTrqr0

#GenAIinSupport #CustomerServiceAI #InferenceOptimization #EnterpriseAI #AppliedAISummit

1/3 Learn in this blog article the key techniques like pruning, model quantization, and hardware acceleration that are enhancing efficiency. #MultimodalAI #LLMs #InferenceOptimization #AnkursNewsletter

aapatel09's tweet image. 1/3 Learn in this blog article the key techniques like pruning, model quantization, and hardware acceleration that are enhancing efficiency. #MultimodalAI #LLMs #InferenceOptimization #AnkursNewsletter

Krishanu

@thysel55307

Mar 20

Supercharge your AI with lightning-fast inference! 🚀 Post-training quantization techniques like AWQ and GPTQ trim down your models without sacrificing smarts—boosting speed and slashing compute costs. Ready to optimize your LLMs for real-world performance? #InferenceOptimization…

Vlad Ruso PhD

@vlruso

Sep 11, 2024

Stanford Researchers Explore Inference Compute Scaling in Language Models: Achieving Enhanced Performance and Cost Efficiency through Repeated Sampling itinai.com/stanford-resea… #AIAvancements #InferenceOptimization #RepeatedSampling #AIApplications #EvolveWithAI #ai #news #llm…

vlruso's tweet image. Stanford Researchers Explore Inference Compute Scaling in Language Models: Achieving Enhanced Performance and Cost Efficiency through Repeated Sampling

itinai.com/stanford-resea…

#AIAvancements #InferenceOptimization #RepeatedSampling #AIApplications #EvolveWithAI #ai #news #llm…

Ankur A. Patel

@aapatel09

Nov 21, 2023

3/3 Explore our comprehensive guide on inference optimization strategies for LLMs here: buff.ly/3R7QFqK 🔁 Spread this thread with your audience by Retweeting this tweet #MultimodalAI #LLMs #InferenceOptimization #AnkursNewsletter

AI Makerspace

@AIMakerspace

Oct 9, 2024

What we’re building 🏗️, shipping 🚢 and sharing 🚀 tomorrow: Inference Optimization with GPTQ Learn how GPTQ’s “one-shot weight quantization” compares to other leading techniques like AWQ Start optimizing: bit.ly/InferenceGPTQ?… #LLMs #GPTQ #InferenceOptimization

AIMakerspace's tweet image. What we’re building 🏗️, shipping 🚢 and sharing 🚀 tomorrow: Inference Optimization with GPTQ

Learn how GPTQ’s “one-shot weight quantization” compares to other leading techniques like AWQ

Start optimizing: bit.ly/InferenceGPTQ?…

#LLMs #GPTQ #InferenceOptimization

Pixi News

@PixiNews

Jul 31

L40S GPUs optimize Llama 3 7B inference at $0.00037/request. Achieve extreme throughput for small LLMs. Benchmark your model. get.runpod.io/oyksj6fqn1b4 #LLM #Llama3 #InferenceOptimization #CostPerRequest

PixiNews's tweet card. GPU cloud computing made simple. Build, train, and deploy AI faster. Pay only for what you use, billed by the millisecond.

Runpod | The cloud built for AI

Source: runpod.io

Managetech inc.

@managetech_inc

Aug 26, 2024

DeepMind と UC Berkeley が LLM 推論時間コンピューティングを最大限に活用する方法を紹介 | VentureBeat #AIcoverage #InferenceOptimization #TestTimeCompute #LLMperformance prompthub.info/40031/

prompthub.info

DeepMind と UC Berkeley が LLM 推論時間コンピューティングを最大限に活用する方法を紹介 | VentureBeat - プロンプトハブ

大規模言語モデル(LLMs)の訓練コストと速度の遅さから、推論による性能向上のためにより多くの計算サイクルを使

Source: prompthub.info

GenAINews.co

@genainewstop

Dec 4

Check out how TensorRT-LLM Speculative Decoding can boost inference throughput by up to 3.6x! #TensorRT #InferenceOptimization #AI #NVIDIA #LLM #DeepLearning 🚀🔥 developer.nvidia.com/blog/tensorrt-…

developer.nvidia.com

TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x | NVIDIA Technical Blog

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support...

Source: developer.nvidia.com

Solysian ZeroX AI MediaTales

@zeroxaitales

Jun 22

Inference is where great AI products either scale—or burn out. 2025’s best AI infra teams aren’t just using better models… They’re running smarter pipelines. This thread: how quantization, batching & caching supercharge LLM apps. #LLMOps #InferenceOptimization #AIInfra

zeroxaitales's tweet image. Inference is where great AI products either scale—or burn out.
2025’s best AI infra teams aren’t just using better models…
They’re running smarter pipelines.
This thread: how quantization, batching &amp; caching supercharge LLM apps.
#LLMOps #InferenceOptimization #AIInfra

Solysian ZeroX AI MediaTales

@zeroxaitales

Jun 18

Capital is also chasing compute arbitrage. Startups using: – Smart model quantization – Faster inference on CPUs – Sovereign training infra Own the stack, own the scale. #computeedge #inferenceoptimization #quantization

Solysian ZeroX AI MediaTales

@zeroxaitales

Jun 21

Model Layer = Swappable Core GPT-4, Claude, Gemini, Mixtral—pick your poison. Top teams don’t pick one. They route by: → Task → Latency → Cost → Accuracy Models are pipes. Routing is strategy. #LLMs #ModelOps #InferenceOptimization

Nekton AI

@NektonAI

Dec 4, 2023

Excited to read about the Large Transformer Model Inference Optimization by @lilianweng! This article provides valuable insights on improving Transformers. Don't miss it! 👉🔍 #InferenceOptimization Check out the article here: lilianweng.github.io/posts/2023-01-…

JohnSnowLabs

@JohnSnowLabs

Oct 8

Pixi News

@PixiNews

Jul 31

Runpod | The cloud built for AI

Source: runpod.io

Solysian ZeroX AI MediaTales

@zeroxaitales

Jun 23

Speed is the Surprise Benefit Quantized models = smaller memory Local batching = faster inference No internet = instant UX Local Mistral can outperform GPT-4 on simple tasks—because latency wins #fastLLMs #inferenceoptimization

developer.nvidia.com

TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x | NVIDIA Technical Blog

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support...

Source: developer.nvidia.com