Steven

@PlsGoForIt

Code Don't be a fool

Stockholm, Sverige

Joined February 2023

1KPosts 105Followers 1KFollowing

Pinned

Steven

@PlsGoForIt

Aug 5, 2024

青山依旧在，几度夕阳红 which means all heros and kings will die, no matter how great achievement they have made. This spirit maybe is influenced by buddism after Tang dynasty. Is this type of disillusionment Chinese culture unique ? Is there other culture with same spirit?

PlsGoForIt's tweet image. 青山依旧在，几度夕阳红 which means all heros and kings will die, no matter how great achievement they have made. This spirit maybe is influenced by buddism after Tang dynasty. Is this type of disillusionment Chinese culture unique ? Is there other culture with same spirit?

Steven reposted

Jean de Nyandwi

@Jeande_d

Oct 28

Reinforcement Learning of Large Language Models, Spring 2025(UCLA) Great set of new lectures on reinforcement learning of LLMs. Covers a wide range of topics related to RLxLLMs such as basics/foundations, test-time compute, RLHF, and RL with verifiable rewards(RLVR).

Jeande_d's tweet image. Reinforcement Learning of Large Language Models, Spring 2025(UCLA)

Great set of new lectures on reinforcement learning of LLMs. Covers a wide range of topics related to RLxLLMs such as basics/foundations, test-time compute, RLHF, and RL with verifiable rewards(RLVR).

Steven reposted

Dria

@driaforall

Oct 9

We're sharing the insights and the technical report behind Mem-Agent, our 4B model for persistent memory in LLMs. How we built it, the benchmarks, and why it works:

driaforall's tweet image. We're sharing the insights and the technical report behind Mem-Agent, our 4B model for persistent memory in LLMs.

How we built it, the benchmarks, and why it works:

Steven reposted

ginobefun

@hongming731

Oct 10

《智能体设计模式》中文翻译计划启动接下来的一周，我将通过 AI 初次翻译 → AI 交叉评审 → 人工精读优化的方式来翻译这本书，所有翻译内容将持续更新到开源项目：github.com/ginobefun/agen… 本书由 Antonio Gulli 撰写、谷歌 Cloud AI 副总裁 Saurabh Tiwary 作序、高盛 CIO Marco Argenti…

hongming731's tweet image. 《智能体设计模式》中文翻译计划启动

接下来的一周，我将通过 AI 初次翻译 → AI 交叉评审 → 人工精读优化的方式来翻译这本书，所有翻译内容将持续更新到开源项目：github.com/ginobefun/agen…

本书由 Antonio Gulli 撰写、谷歌 Cloud AI 副总裁 Saurabh Tiwary 作序、高盛 CIO Marco Argenti…

ginobefun

@hongming731

Oct 8

正如设计模式曾是软件工程的圣经，这本由谷歌资深工程主管免费分享的《智能体设计模式》，正为火热的 AI Agent 领域带来首套系统性的设计原则与最佳实践。🔽

Steven reposted

You Jiacheng

@YouJiacheng

Sep 29

BTW, They released a deep dive on FP8 KVCache of main MLA. github.com/deepseek-ai/Fl… so, actually ≈1/5 compared to FP8 dense MLA.

You Jiacheng

@YouJiacheng

Sep 29

As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer. Asymptotic cost ratio = 128/576. In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.

YouJiacheng's tweet image. As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer.
Asymptotic cost ratio = 128/576.
In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.

Steven reposted

Alex Dimakis

@AlexGDimakis

Sep 30

Cool new blog post by Thinking machines: LoRA is all you need for SFT and RL, even for medium-sized post-training runs. Some highlights: Rank 64 or 128 seems to be very close to full FT in performance. Also interesting findings for how to set learning rates: The optimal FullFT…

Thinking Machines

@thinkymachines

Sep 29

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

thinkymachines's tweet image. LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

Steven reposted

vLLM

@vllm_project

Sep 29

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

DeepSeek

@deepseek_ai

Sep 29

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

Steven reposted

Aran Komatsuzaki

@arankomatsuzaki

Sep 30

ReasoningBank: memory for self-evolving LLM agents • Distills strategies from both successes & failures • Enables agents to learn, reuse, and improve over time • Outperforms prior memory methods on web & SWE tasks (+34.2% eff., –16% steps)

arankomatsuzaki's tweet image. ReasoningBank: memory for self-evolving LLM agents

• Distills strategies from both successes &amp; failures
• Enables agents to learn, reuse, and improve over time
• Outperforms prior memory methods on web &amp; SWE tasks (+34.2% eff., –16% steps)

Steven reposted

Horace He

@cHHillee

Sep 29

I quite enjoyed this and it covers a bunch of topics without good introductory resources! 1. A bunch of GPU hardware details in one place (warp schedulers, shared memory, etc.) 2. A breakdown/walkthrough of reading PTX and SASS. 3. Some details/walkthroughs of a number of other…

Aleksa Gordić (水平问题)

@gordic_aleksa

Sep 29

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute…

gordic_aleksa's tweet image. New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along.

(Remember matmul is the single most important operation that transformers execute…

Steven reposted

Dan Alistarh

@DAlistarh

Sep 29

Introducing LLM.Q: Quantized LLM training in pure CUDA/C++! With LLM.Q, you can train your own LLM on consumer GPUs with natively quantized matmuls, on single workstations. No datacenter required. Inspired by @karpathy's llm.c, but natively quantized.

Steven reposted

Thinking Machines

@thinkymachines

Sep 10

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

thinkymachines's tweet image. Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to…

Steven reposted

Vivek Galatage

@vivekgalatage

Sep 12

Some papers are timeless and "essential reading"! Understanding Integer Overflow in C/C++ users.cs.utah.edu/~regehr/papers…

vivekgalatage's tweet image. Some papers are timeless and "essential reading"!

Understanding Integer Overflow in C/C++

users.cs.utah.edu/~regehr/papers…

Steven reposted

九原客

@9hills

Sep 3

免费 GPU 或廉价算力推荐，可用于学习10B以内模型微调训练，我会给学员推荐这些环境练习模型训练。 github.com/ninehills/blog…

Steven reposted

Krishna Mohan

@KMohan2006

Sep 2

Good morning with a banger resource on vLLM

Steven reposted

Rohit Kumar Tiwari

@_rohit_tiwari_

Aug 24

The Ultimate Guide to Fine-Tuning LLMs. Here’s a 7-step roadmap that makes it simple. This technical report takes a deep look at the fine-tuning process for LLMs. Combining both theory and practice. 1. Introduction 2. Seven Stage Fine-Tuning Pipeline ↳ Stage-1: Data…

_rohit_tiwari_'s tweet image. The Ultimate Guide to Fine-Tuning LLMs.

Here’s a 7-step roadmap that makes it simple.

This technical report takes a deep look at the fine-tuning process for LLMs.

Combining both theory and practice.

1. Introduction

2. Seven Stage Fine-Tuning Pipeline
↳ Stage-1: Data…

Steven reposted

You Jiacheng

@YouJiacheng

Aug 24

btw it's possible to use mantissa-free weights. The Chief Scientist of NVIDIA had a keynote with content about it. nvidia.com/en-us/on-deman…

YouJiacheng's tweet image. btw it's possible to use mantissa-free weights.
The Chief Scientist of NVIDIA had a keynote with content about it.
nvidia.com/en-us/on-deman…

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

Aug 24

Not sure if this is the case now but some miss that UE8M0 (Unsigned, Exponent 8, Mantissa 0) used for V3.1 is a microscaling data format. They're NOT using mantissa-free *weights*. It's just for large dynamic range, cheaply applied scale factors.

teortaxesTex's tweet image. Not sure if this is the case now but some miss that UE8M0 (Unsigned, Exponent 8, Mantissa 0) used for V3.1 is a microscaling data format. They're NOT using mantissa-free *weights*. It's just for large dynamic range, cheaply applied scale factors.

Steven reposted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

Aug 25

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs "RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank…

iScienceLuvr's tweet image. RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

"RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank…

Steven reposted

SemiAnalysis

@SemiAnalysis_

Aug 25

TogetherAI's Chief Scientist @tri_dao announced Flash Attention v4 at HotChips Conference which is up to 22% faster than the attention kernel implementation from NVIDIA's cuDNN library. Tri Dao was able to achieve this 2 key algorithmic changes. Firstly, it uses a new online…

SemiAnalysis_'s tweet image. TogetherAI's Chief Scientist @tri_dao announced Flash Attention v4 at HotChips Conference which is up to 22% faster than the attention kernel implementation from NVIDIA's cuDNN library. Tri Dao was able to achieve this 2 key algorithmic changes. Firstly, it uses a new online…

Steven reposted

Mohit Mishra

@chessMan786

Aug 13

Understanding the Linux Virtual Memory Manager

Steven reposted

Guangxuan Xiao

@Guangxuan_Xiao

Aug 8

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

Guangxuan_Xiao's tweet image. I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models.

For those interested in the details:
hanlab.mit.edu/blog/streaming…

Steven reposted

himanshu

@himanshustwts

Aug 9

just published a curated list of amazing blogs/articles on ai. highly selective. feel free to comment if you find anything interesting, will update it async. (link in replies)