asyncanoop's profile picture. I correlate; therefore, I cause! Building and deploying RL post training Infrastructure @awscloud. #hyperpod #100kGPU

Anoop Saha

@asyncanoop

I correlate; therefore, I cause! Building and deploying RL post training Infrastructure @awscloud. #hyperpod #100kGPU

Angepinnt

When in doubt, write CUDA kernels.

there are probably less than ~100 people living who can write performant CUDA kernels for training specifically. definitely quite a bit more if we include inference optimizations; backwards kernels, not so much



Anoop Saha hat repostet

Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs. It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws. Must read for AI nerds.

deedydas's tweet image. Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs.

It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws.

Must read for AI nerds.

Don’t do that. I REPEAT. Do not work on an FPGA.

openai is building its own chips and youre still trying to learn CUDA pick up an fpga and learn



👀👀👀

SCOOP: TSMC IS LOSING MONEY ON NEW 2NM CHIP RAMP!!!!!! THERE ARE NEGATIVE GROSS MARGINS ON A NEW PRODUCT, THERE ARE NO CLEAR PLANS ON HOW TO IMPROVE MARGIN ACCORDING TO DOCUMENTS ACQUIRED BY ME (SEC FILINGS)



Huge news for both AMD and OpenAI! - 6 GW of AMD MI450, Helios, and next gen deployment for OpenAI (NVIDIA deal with OpenAI is at 10GW) - 160 million AMD shares to OpenAI in tranches - Incremental revenue of over $100B over the next few years wsj.com/tech/ai/openai…


There are two main problems in AI that’s worth solving for. How to build an effective model cost effectively that can learn continuously based on user interactions. And how to serve these models cost effectively - maximizing the tokens per $$. And we are still a year or two out…


In between all the AI for science news, the coolest effort is @CAIAorg - the cancer AI alliance - bringing folks together across the industry. If it can make meaningful progress, world would be a great place.


Anoop Saha hat repostet

🚀 This is a call to tackle the hard-tier problems where RLVR pipelines stall with pass@k=0 (no reward, no gradient). To push further, we need to train on hard subsets and leverage signals from tough data — that’s where discovery happens. 👉 github.com/sunblaze-ucb/a… The grand…


My entire timeline is @periodiclabs content. Great job to the team.

Can't understate the talent level I'm surrounded by at @PeriodicLabs The team is largely a mix of: - researchers from OpenAI, GDM, FAIR/MSL - PhDs from Stanford, MIT, Harvard, etc - leaders in open-source ML - world-class physics professors Humbled to learn from them every day.



Anoop Saha hat repostet

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute…

gordic_aleksa's tweet image. New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along.

(Remember matmul is the single most important operation that transformers execute…

“The reports of my demise are highly exaggerated “ - GRPO, circa 2025.

We may get a GRPO update in the next week Will @sybilhyz cook again??

zephyr_z9's tweet image. We may get a GRPO update in the next week
Will @sybilhyz cook again??


This take is accurate - open weight models is not open source. Nobody is releasing their training code or even the training data. However the true assets to the ecosystem are open source technology - from DeepEP by DeepSeek, to Verl by Bytedance, PyTorch and Ray, sglang and…

Anthropic CEO Dario谈开源模型: - 大模型开放权重不同于软件开源,不存在开发者社区的反向贡献。 - 开源只是吸引注意力的幌子,用户只关心这个模型是否好用。Deepseek开源与否都无所谓,作为一个超大模型,推理起来很困难。 - 开源并不等于免费,推理服务器运行,是有成本的。



The cruelty is the point….

Those on an H1B cannot return to the US from tomorrow (Sunday) unless paying $100K. This is an out-of-the blue presidential action. We’ll see software engineers stranded abroad. One easy to predict outcome: those on US visas will travel less… for work, for conferences etc.

GergelyOrosz's tweet image. Those on an H1B cannot return to the US from tomorrow (Sunday) unless paying $100K. This is an out-of-the blue presidential action. We’ll see software engineers stranded abroad.

One easy to predict outcome: those on US visas will travel less… for work, for conferences etc.


This is huge and can alone bring down the cost and duration of RL significantly.

Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL. ✅ Update a 1T model on thousands of GPUs in ~20s ✅ Supports both broadcast (sync) & P2P (dynamic) updates ✅…

Kimi_Moonshot's tweet image. Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL.

✅ Update a 1T model on thousands of GPUs in ~20s
✅ Supports both broadcast (sync) & P2P (dynamic) updates
✅…
Kimi_Moonshot's tweet image. Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL.

✅ Update a 1T model on thousands of GPUs in ~20s
✅ Supports both broadcast (sync) & P2P (dynamic) updates
✅…


Anoop Saha hat repostet

My Google colleagues Norm Jouppi & Sridhar Lakshmanamurthy gave a talk today at Hot Chips on TPUv7 ("Ironwood"). The TPUv7 system offers 9216 chips / pod (42.5 exaflops of fp8), but we can scale across many of these pods to provide multiple zettaflops.

Google presents for the first time ever their TPUv7 block diagram at hot chips conference. TPUv7 (formerly known as TPUv6p, internally called ghostfish) has 8 stacks of HBM3e memory, 4 medium size systolic arrays and be connected in a 3D torus with a scale up world size of up to…

SemiAnalysis_'s tweet image. Google presents for the first time ever their TPUv7 block diagram at hot chips conference. TPUv7 (formerly known as TPUv6p, internally called ghostfish) has 8 stacks of HBM3e memory, 4 medium size systolic arrays and be connected in a 3D torus with a scale up world size of up to…


It’s Hot Chips day!! Looking forward to the talks this afternoon…


This is an excellent overview. I am surprised despite all the growth in AI, how little material there is in core GPU architecture. The JAX team changes it for good.

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

jacobaustin132's tweet image. Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n


Anoop Saha hat repostet

Amazon’s chip division is working with Taiwan’s Alchip on Trainium3 and Trainium4 to bring the chips to mass production on TSMC 3nm and 2nm processes, respectively, media report, citing unnamed industry sources. Trainium3 will be in mass production in the 1st quarter of 2026.…


Loading...

Something went wrong.


Something went wrong.