Jongsu Liam Kim

@sky0bserver

The CFD enthusiast has become a ML researcher. Senior Researcher at LG CNS AI Lab. Opinions are solely my own and do not express the opinions of my employer.

Seoul

liam.kim

Joined November 2010

1KPosts 136Followers 696Following

You might like

@achimnol

@kkung

@yu_paran

@gnutel

@winwjnwin

@J_hree

@petanerd

@whryssrs

@hyunsang_dev

@noricube

@ynokim_

@Sytronik20

Jongsu Liam Kim reposted

arya

@aryagxr

Nov 7

tfw you find a good cuda blog that you can actually follow along and reproduce the results because it’s optimized for your gpu arch

aryagxr's tweet image. tfw you find a good cuda blog that you can actually follow along and reproduce the results because it’s optimized for your gpu arch

Jongsu Liam Kim reposted

GPU MODE

@GPU_MODE

Nov 7

The most intuitive explanation of floats I've ever come across, courtesy of @fabynou fabiensanglard.net/floating_point…

Jongsu Liam Kim reposted

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects Swizzled Head-first Mapping cuts attention latency on chiplet GPUs by making scheduling NUMA-aware. It maps all row-blocks of a head (or KV-group in GQA) to the same XCD, so K/V first-touch stays hot in…

gm8xx8's tweet image. Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

Swizzled Head-first Mapping cuts attention latency on chiplet GPUs by making scheduling NUMA-aware. It maps all row-blocks of a head (or KV-group in GQA) to the same XCD, so K/V first-touch stays hot in…

Jongsu Liam Kim reposted

Hao AI Lab

@haoailab

Nov 4

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our…

haoailab's tweet card. Eighteen months ago, our lab introduced DistServe with a simple bet: split LLM inference into prefill and decode, and scale them independently on separate compute pools. Today, almost every product...

Disaggregated Inference: 18 Months Later

Source: hao-ai-lab.github.io

Jongsu Liam Kim reposted

Alex L Zhang

@a1zhang

Nov 3

Please register here: luma.com/9n27uem4 Like previous competitions, this competition will take place on the GPU MODE Discord server. More information can be found on the registration link. Good luck to all competitors!

a1zhang's tweet image. Please register here: luma.com/9n27uem4

Like previous competitions, this competition will take place on the GPU MODE Discord server. More information can be found on the registration link.

Good luck to all competitors!

Jongsu Liam Kim reposted

Yingru Li

@RichardYRLi

Oct 2

Inspired by @thinkymachines 's "#LoRA Without Regret" post, I formalized their insight that policy gradient learns ~1 bit per episode via Bayesian #RL formulation. I prove this is a hard information-theoretic ceiling and extend the analysis to actor-critic methods. Full writeup…

RichardYRLi's tweet card. An information-theoretic analysis showing that scalar advantage formulations learn ≤ log₂(B) bits per episode, while per-timestep advantages preserve full reward entropy.

Information Bandwidth in Reinforcement Learning | Yingru Li

Source: richardli.xyz

Thinking Machines

@thinkymachines

Sep 29

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

thinkymachines's tweet image. LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

Jongsu Liam Kim reposted

Balanced Acceleration (b/acc)

@AccBalanced

Nov 2

This remote memory (NVMe over NVLink at full MBU) may also be of interest weka.io/blog/ai-ml/unl…

Jongsu Liam Kim reposted

Charles 🎉 Frye

@charles_irl

Dec 12

I think programming GPUs is too hard. Part of the problem is sprawling, scattered documentation & best practices. Over the past few months, we’ve been working to solve that problem, putting together a “Rosetta Stone” GPU Glossary. And now it’s live! My take-aways in thread.

charles_irl's tweet image. I think programming GPUs is too hard. Part of the problem is sprawling, scattered documentation &amp; best practices.

Over the past few months, we’ve been working to solve that problem, putting together a “Rosetta Stone” GPU Glossary.

And now it’s live!

My take-aways in thread.

Jongsu Liam Kim reposted

Thomas Wolf

@Thom_Wolf

Oct 30

We’ve cooked another one of these 200+ pages practical books on model training that we love to write. This time it’s on all pretraining and post-training recipes and how to do a training project hyper parameter exploration. Closing the trilogy of: 1. Building a pretraining…

elie

@eliebakouch

Oct 30

Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably huggingface.co/spaces/Hugging…

eliebakouch's tweet image. Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably

huggingface.co/spaces/Hugging…

Jongsu Liam Kim reposted

Rishabh Agarwal

@agarwl_

Oct 27

Very nice blog post from Thinky (@_kevinlu et al) about on-policy distillation for LLMs -- we published this idea back in 2023 and it is *publicly* known to be successfully applied to Gemma 2 & 3, and Qwen3-Thinking (and probably many closed frontier models)! The idea behind…

Thinking Machines

@thinkymachines

Oct 27

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other…

thinkymachines's tweet image. Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other…

Jongsu Liam Kim reposted

Simo Ryu

@cloneofsimo

Oct 28

I remember discussing this with @SeunghyunSEO7 literally 2 years ago. Time flies!!!

Jason Lee

@jasondeanlee

Oct 28

Please pay me 100m to convert papers like openreview.net/pdf?id=3zKtaqx… to blogposts! @agarwl_

Jongsu Liam Kim reposted

Reece Shuttleworth

@ReeceShuttle

Oct 28

🧵 LoRA vs full fine-tuning: same performance ≠ same solution. Our NeurIPS ‘25 paper 🎉shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple…

ReeceShuttle's tweet image. 🧵 LoRA vs full fine-tuning: same performance ≠ same solution.

Our NeurIPS ‘25 paper 🎉shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple…

Jongsu Liam Kim reposted

Edward Z. Yang

@ezyang

Oct 25

A small thread about how you should be drawing the contents of higher dimensional tensors

Jongsu Liam Kim reposted

NYRE

@sleenyre

Oct 25

Reading through Torchcomms paper, there are a few nice features it introduces It defaults to zero copy transfer for comms. With copy-based transfers, it uses SM / HBM memory on GPU and maintains a small FIFO queue to transfer data. This can introduce some overhead + contention…

sleenyre's tweet image. Reading through Torchcomms paper, there are a few nice features it introduces

It defaults to zero copy transfer for comms. With copy-based transfers, it uses SM / HBM memory on GPU and maintains a small FIFO queue to transfer data. This can introduce some overhead + contention…

Jongsu Liam Kim reposted

λux

@novasarc01

Oct 26

link - zhuanlan.zhihu.com/p/17186885141

Jongsu Liam Kim reposted

Guangxuan Xiao

@Guangxuan_Xiao

Oct 14

Excited to share our new work: StreamingVLM! 🚀 We tackle a major challenge for Vision-Language Models (VLMs): understanding infinite video streams in real-time without latency blowing up or running out of memory. Paper: arxiv.org/abs/2510.09608 Code: github.com/mit-han-lab/st…

Jongsu Liam Kim reposted

Vincent Abbott

@vtabbott_

Oct 14

Will present automatically derived kernels to @GPU_MODE noon PST Saturday. Got to @MIT in September, been grinding maths w/ @GioeleZardini to ensure universal applicability across models and hardware. Hierarchical kernels, encoding, optimization. This is gonna be good.

vtabbott_'s tweet image. Will present automatically derived kernels to @GPU_MODE noon PST Saturday. Got to @MIT in September, been grinding maths w/ @GioeleZardini to ensure universal applicability across models and hardware. Hierarchical kernels, encoding, optimization. This is gonna be good.

Jongsu Liam Kim reposted

Zichen Liu

@zzlccc

Oct 3

6 months after our paper release, I still recall the debates on removing the length normalization term in DrGRPO. And people gradually think DrGRPO is just about removing the std, ignoring the most important and subtle (length) bias we tried to point out to the community. Even…

zzlccc's tweet image. 6 months after our paper release, I still recall the debates on removing the length normalization term in DrGRPO. And people gradually think DrGRPO is just about removing the std, ignoring the most important and subtle (length) bias we tried to point out to the community.
Even…

Jongsu Liam Kim reposted

utkarsh

@utkarsh_2105

Oct 1

I have written another blogpost that is so pretty: covering kernels, graphics, profiling, etc. etc. ut21.github.io/blog/triton.ht…

utkarsh

@utkarsh_2105

Jun 26, 2024

I have written a blog post that is so long: ut21.github.io/utkarsh/blogpo…

Jongsu Liam Kim reposted

Jeremy Cohen

@deepcohen

Oct 1

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.