Anoop Saha

@asyncanoop

I correlate; therefore, I cause! Building and deploying RL post training Infrastructure @awscloud. #hyperpod #100kGPU

ÜT: 28.578672,77.23953

Beigetreten im April 2010

2KPosts 741Follower 2KFolge ich

Was dir gefallen könnte

@slowbuddens

@dylan522p

@TheVikasKhanna

@GaonConnection

@CabotAnalysts

@IanCutress

@samar11

@sanjukta

@boredbh4r

@imsabbah

@JihadiJew

@OxbloodRuffin

@VipulRamtekkar

@sonia_manchanda

@JlnYadav

Angepinnt

Anoop Saha

@asyncanoop

16.09.

When in doubt, write CUDA kernels.

there are probably less than ~100 people living who can write performant CUDA kernels for training specifically. definitely quite a bit more if we include inference optimizations; backwards kernels, not so much

Anoop Saha hat repostet

Deedy

@deedydas

12 Std.

Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs. It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws. Must read for AI nerds.

deedydas's tweet image. Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs.

It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws.

Must read for AI nerds.

Anoop Saha

@asyncanoop

14.10.

Don’t do that. I REPEAT. Do not work on an FPGA.

Steve the Beaver

@beaversteever

13.10.

openai is building its own chips and youre still trying to learn CUDA pick up an fpga and learn

Anoop Saha

@asyncanoop

07.10.

👀👀👀

Fabricated Knowledge

@_fabknowledge_

07.10.

SCOOP: TSMC IS LOSING MONEY ON NEW 2NM CHIP RAMP!!!!!! THERE ARE NEGATIVE GROSS MARGINS ON A NEW PRODUCT, THERE ARE NO CLEAR PLANS ON HOW TO IMPROVE MARGIN ACCORDING TO DOCUMENTS ACQUIRED BY ME (SEC FILINGS)

Anoop Saha

@asyncanoop

06.10.

Huge news for both AMD and OpenAI! - 6 GW of AMD MI450, Helios, and next gen deployment for OpenAI (NVIDIA deal with OpenAI is at 10GW) - 160 million AMD shares to OpenAI in tranches - Incremental revenue of over $100B over the next few years wsj.com/tech/ai/openai…

asyncanoop's tweet card. The five-year agreement will challenge Nvidia’s market dominance, gives OpenAI 10% of AMD if it hits milestones for chip deployment.

OpenAI, AMD Announce Massive Computing Deal, Marking New Phase of AI Boom

Quelle: wsj.com

Anoop Saha

@asyncanoop

05.10.

There are two main problems in AI that’s worth solving for. How to build an effective model cost effectively that can learn continuously based on user interactions. And how to serve these models cost effectively - maximizing the tokens per $$. And we are still a year or two out…

Anoop Saha

@asyncanoop

03.10.

In between all the AI for science news, the coolest effort is @CAIAorg - the cancer AI alliance - bringing folks together across the industry. If it can make meaningful progress, world would be a great place.

Anoop Saha hat repostet

Yiyou Sun

@YiyouSun

02.10.

🚀 This is a call to tackle the hard-tier problems where RLVR pipelines stall with pass@k=0 (no reward, no gradient). To push further, we need to train on hard subsets and leverage signals from tough data — that’s where discovery happens. 👉 github.com/sunblaze-ucb/a… The grand…

YiyouSun's tweet card. updated address: https://github.com/rdi-berkeley/awesome-RLVR-boundary - sunblaze-ucb/awesome-RLVR-boundary

GitHub - sunblaze-ucb/awesome-RLVR-boundary: updated address: https://github.com/rdi-berkeley/awe...

Quelle: github.com

Anoop Saha

@asyncanoop

30.09.

My entire timeline is @periodiclabs content. Great job to the team.

Rohan Pandey

@khoomeik

30.09.

Can't understate the talent level I'm surrounded by at @PeriodicLabs The team is largely a mix of: - researchers from OpenAI, GDM, FAIR/MSL - PhDs from Stanford, MIT, Harvard, etc - leaders in open-source ML - world-class physics professors Humbled to learn from them every day.

Anoop Saha hat repostet

Aleksa Gordić (水平问题)

@gordic_aleksa

29.09.

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute…

gordic_aleksa's tweet image. New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along.

(Remember matmul is the single most important operation that transformers execute…

Anoop Saha

@asyncanoop

29.09.

“The reports of my demise are highly exaggerated “ - GRPO, circa 2025.

Zephyr

@zephyr_z9

28.09.

We may get a GRPO update in the next week Will @sybilhyz cook again??

Anoop Saha

@asyncanoop

29.09.

Or tell Claude code to write them for you. Re: github.com/ScalingIntelli…

asyncanoop's tweet card. KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems - ScalingIntelligence/KernelBench

GitHub - ScalingIntelligence/KernelBench: KernelBench: Can LLMs Write GPU Kernels? - Benchmark with...

Quelle: github.com

Anoop Saha

@asyncanoop

16.09.

When in doubt, write CUDA kernels.

Anoop Saha

@asyncanoop

27.09.

In this new blog, we show how you can leverage the instance topology with EKS to reduce latency of distributed inference aws.amazon.com/blogs/machine-…

aws.amazon.com

Schedule topology-aware workloads using Amazon SageMaker HyperPod task governance | Amazon Web...

In this post, we introduce topology-aware scheduling with SageMaker HyperPod task governance by submitting jobs that represent hierarchical network information. We provide details about how to use...

Quelle: aws.amazon.com

Anoop Saha

@asyncanoop

23.09.

This take is accurate - open weight models is not open source. Nobody is releasing their training code or even the training data. However the true assets to the ecosystem are open source technology - from DeepEP by DeepSeek, to Verl by Bytedance, PyTorch and Ray, sglang and…

高级分析师

@techeconomyana

22.09.

Anthropic CEO Dario谈开源模型： - 大模型开放权重不同于软件开源，不存在开发者社区的反向贡献。 - 开源只是吸引注意力的幌子，用户只关心这个模型是否好用。Deepseek开源与否都无所谓，作为一个超大模型，推理起来很困难。 - 开源并不等于免费，推理服务器运行，是有成本的。

Anoop Saha

@asyncanoop

20.09.

The cruelty is the point….

Gergely Orosz

@GergelyOrosz

20.09.

Those on an H1B cannot return to the US from tomorrow (Sunday) unless paying $100K. This is an out-of-the blue presidential action. We’ll see software engineers stranded abroad. One easy to predict outcome: those on US visas will travel less… for work, for conferences etc.

GergelyOrosz's tweet image. Those on an H1B cannot return to the US from tomorrow (Sunday) unless paying $100K. This is an out-of-the blue presidential action. We’ll see software engineers stranded abroad.

One easy to predict outcome: those on US visas will travel less… for work, for conferences etc.

Anoop Saha

@asyncanoop

11.09.

This is huge and can alone bring down the cost and duration of RL significantly.

Kimi.ai

@Kimi_Moonshot

10.09.

Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL. ✅ Update a 1T model on thousands of GPUs in ~20s ✅ Supports both broadcast (sync) & P2P (dynamic) updates ✅…

Kimi_Moonshot's tweet image. Introducing checkpoint-engine: our open-source, lightweight middleware for efficient, in-place weight updates in LLM inference engines, especially effective for RL.

✅ Update a 1T model on thousands of GPUs in ~20s
✅ Supports both broadcast (sync) &amp; P2P (dynamic) updates
✅…

Anoop Saha hat repostet

Jeff Dean

@JeffDean

27.08.

My Google colleagues Norm Jouppi & Sridhar Lakshmanamurthy gave a talk today at Hot Chips on TPUv7 ("Ironwood"). The TPUv7 system offers 9216 chips / pod (42.5 exaflops of fp8), but we can scale across many of these pods to provide multiple zettaflops.

SemiAnalysis

@SemiAnalysis_

26.08.

Google presents for the first time ever their TPUv7 block diagram at hot chips conference. TPUv7 (formerly known as TPUv6p, internally called ghostfish) has 8 stacks of HBM3e memory, 4 medium size systolic arrays and be connected in a 3D torus with a scale up world size of up to…

SemiAnalysis_'s tweet image. Google presents for the first time ever their TPUv7 block diagram at hot chips conference. TPUv7 (formerly known as TPUv6p, internally called ghostfish) has 8 stacks of HBM3e memory, 4 medium size systolic arrays and be connected in a 3D torus with a scale up world size of up to…

Anoop Saha

@asyncanoop

25.08.

It’s Hot Chips day!! Looking forward to the talks this afternoon…

Anoop Saha

@asyncanoop

19.08.

This is an excellent overview. I am surprised despite all the growth in AI, how little material there is in core GPU architecture. The JAX team changes it for good.

Jacob Austin

@jacobaustin132

18.08.

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

jacobaustin132's tweet image. Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

Anoop Saha hat repostet

Dan Nystedt

@dnystedt

14.08.

Amazon’s chip division is working with Taiwan’s Alchip on Trainium3 and Trainium4 to bring the chips to mass production on TSMC 3nm and 2nm processes, respectively, media report, citing unnamed industry sources. Trainium3 will be in mass production in the 1st quarter of 2026.…