redoragd's profile picture.

Gordon Lee🍀

@redoragd

Pinned

CLICK github.com/doragd AND BECOME FRIENDS 🌸🌸

Really Appreciate

redoragd's tweet image. Really Appreciate


Gordon Lee🍀 reposted

Very cool blog by @character_ai diving into how they trained their proprietary model Kaiju (13B, 34B, 110B), before switching to OSS model, and spoiler: it has Noam Shazeer written all over it. Most of the choices for model design (MQA, SWA, KV Cache, Quantization) are not to…

eliebakouch's tweet image. Very cool blog by @character_ai diving into how they trained their proprietary model Kaiju (13B, 34B, 110B),  before switching to OSS model, and spoiler: it has Noam Shazeer written all over it.

Most of the choices for model design (MQA, SWA, KV Cache, Quantization) are not to…

Gordon Lee🍀 reposted

Multi-vector embeddings (e.g., ColBERT) are powerful but expensive to scale. MUVERA cuts their memory costs by 70%. MUVERA (Multi-Vector Retrieval via Fixed Dimensional Encodings) is an interesting approach by Google Research. It transforms multi-vector representations into…

helloiamleonie's tweet image. Multi-vector embeddings (e.g., ColBERT) are powerful but expensive to scale.

MUVERA cuts their memory costs by 70%.

MUVERA (Multi-Vector Retrieval via Fixed Dimensional Encodings) is an interesting approach by Google Research. It transforms multi-vector representations into…

Gordon Lee🍀 reposted

wtf, a 80-layers 321M model???

Synthetic playgrounds enabled a series of controlled experiments that brought us to favor extreme depth design. We selected a 80-layers architecture for Baguettotron, with improvements across the board on memorization of logical reasoning: huggingface.co/PleIAs/Baguett…

Dorialexander's tweet image. Synthetic playgrounds enabled a series of controlled experiments that brought us to favor extreme depth design. We selected a 80-layers architecture for Baguettotron, with improvements across the board on memorization of logical reasoning: huggingface.co/PleIAs/Baguett…


Gordon Lee🍀 reposted

卡内基梅隆大学机器学习系主任 Zico Kolter 要开一门新课《Intro to Modern AI》,课程内容是从零构建一个 PyTorch 聊天机器人。 讲的就是我们每天接触的 AI 是怎么跑起来的。 我国内就读大学时的整套课程体系就是采购的 CMU 的内容,亲身体验过,CMU 在教学体系这块是真的强。…

I'm teaching a new "Intro to Modern AI" course at CMU this Spring: modernaicourse.org. It's an early-undergrad course on how to build a chatbot from scratch (well, from PyTorch). The course name has bothered some people – "AI" usually means something much broader in academic…



Gordon Lee🍀 reposted

MiniMind:3块钱、2小时训练一个小GPT 从头开始训练一个小型LLM,仅0.02B参数,没有使用第三方封装好的库,全部使用PyTorch重构实现。是LLM的开源复现,也是一个LLM入门实操教程。 Github:github.com/jingyaogong/mi…


Gordon Lee🍀 reposted

7+ main precision formats used in AI ▪️ FP32 ▪️ FP16 ▪️ BF16 ▪️ FP8 (E4M3 / E5M2) ▪️ FP4 ▪️ INT8/INT4 ▪️ 2-bit (ternary/binary quantization) General trend: higher precision for training, lower precision for inference. Save the list and learn more about these formats here:…

TheTuringPost's tweet image. 7+ main precision formats used in AI

▪️ FP32
▪️ FP16
▪️ BF16
▪️ FP8 (E4M3 / E5M2)
▪️ FP4
▪️ INT8/INT4
▪️ 2-bit (ternary/binary quantization)

General trend: higher precision for training, lower precision for inference.

Save the list and learn more about these formats here:…

Gordon Lee🍀 reposted

My new field guide to alternatives to standard LLMs: Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers. magazine.sebastianraschka.com/p/beyond-stand…

rasbt's tweet image. My new field guide to alternatives to standard LLMs: 

Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.

magazine.sebastianraschka.com/p/beyond-stand…

Gordon Lee🍀 reposted

With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened. First, linear attention variants have been around for a long time, and I remember…

rasbt's tweet image. With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.

First, linear attention variants have been around for a long time, and I remember…

Gordon Lee🍀 reposted

想深入了解 ChatGPT、Claude 这些 AI 背后的训练机制,尤其是它们背后那套如何通过人类反馈变得越来越智能的原理。 可以看下,来自加州大学数学系教授 Ernest K. Ryu 开设的《大语言模型的强化学习》课程,配套 PPT 和视频可以免费学习。 课程从深度强化学习基础讲起,逐步深入到 Transformer…

GitHub_Daily's tweet image. 想深入了解 ChatGPT、Claude 这些 AI 背后的训练机制,尤其是它们背后那套如何通过人类反馈变得越来越智能的原理。

可以看下,来自加州大学数学系教授 Ernest K. Ryu 开设的《大语言模型的强化学习》课程,配套 PPT 和视频可以免费学习。

课程从深度强化学习基础讲起,逐步深入到 Transformer…

Gordon Lee🍀 reposted

HuggingFace 发布的超长技术博客(200页,2-4天才能读完),完整记录了团队训练 SmolLM3 的全过程,对于想训练小模型的团队,必看!…

shao__meng's tweet image. HuggingFace 发布的超长技术博客(200页,2-4天才能读完),完整记录了团队训练 SmolLM3 的全过程,对于想训练小模型的团队,必看!…

Gordon Lee🍀 reposted

Beautiful technical debugging detective longread that starts with a suspicious loss curve and ends all the way in the Objective-C++ depths of PyTorch MPS backend of addcmul_ that silently fails on non-contiguous output tensors. I wonder how long before an LLM can do all of this.

New blog post: The bug that taught me more about PyTorch than years of using it started with a simple training loss plateau... ended up digging through optimizer states, memory layouts, kernel dispatch, and finally understanding how PyTorch works!

ElanaPearl's tweet image. New blog post: The bug that taught me more about PyTorch than years of using it

started with a simple training loss plateau... ended up digging through optimizer states, memory layouts, kernel dispatch, and finally understanding how PyTorch works!


Gordon Lee🍀 reposted

如何给 llama.cpp 推理引擎增加新模型架构的教程! 来自 pwilkin,没错,就是前几天给 llama.cpp 增加 Qwen3-Next 架构的大佬。 教程很不错,我觉得甚至能当 prompt 用,把新架构和这篇教程塞给大模型,直接让大模型开始实现你需要的大模型架构。 地址:github.com/ggml-org/llama…

karminski3's tweet image. 如何给 llama.cpp 推理引擎增加新模型架构的教程!

来自 pwilkin,没错,就是前几天给 llama.cpp 增加 Qwen3-Next 架构的大佬。

教程很不错,我觉得甚至能当 prompt 用,把新架构和这篇教程塞给大模型,直接让大模型开始实现你需要的大模型架构。

地址:github.com/ggml-org/llama…

Gordon Lee🍀 reposted

在花费 60 个小时解构 Elon Musk 的传记后,Founders Podcast 主持人 David Senra 提炼了 100 条 Elon 三十年来的职业生涯法则,强烈推荐本期播客!这里列出了 31 条极其宝贵的公司运作原则: 1. 使命第一; 2. 绝不退缩; 3. 疯狂的紧迫感是我们的行动准则; 4. 产品设计应由工程师驱动; 5.…

New episode: "How Elon Works" This episode covers the insanely valuable company-building principles of Elon Musk A few notes from the episode: 1. The mission comes first. 2. Retreat is not an option. 3. A maniacal sense of urgency is our operating principle. 4. Product…



Gordon Lee🍀 reposted

oh god

danhergir's tweet image. oh god

Gordon Lee🍀 reposted

Tiny Reasoning Language Model (trlm-135) - Technical Blogpost⚡ Three weeks ago, I shared a weekend experiment: trlm-135, a tiny language model taught to think step-by-step. The response was incredible and now, the full technical report is live: shekswess.github.io/tiny-reasoning…


Gordon Lee🍀 reposted

I finally understand GPUs thanks to this

rcx86's tweet image. I finally understand GPUs thanks to this

Gordon Lee🍀 reposted

想学习高级逆向工程技术,特别恶意软件分析这块,网上资料要么过于零散,要么太理论化缺少实操,不知道该从何学起。 恰巧,在 GitHub 上发现了 CS7038-Malware-Analysis 这门完整的大学课程,为我们提供了一条从零到精通的系统学习路径。…

GitHub_Daily's tweet image. 想学习高级逆向工程技术,特别恶意软件分析这块,网上资料要么过于零散,要么太理论化缺少实操,不知道该从何学起。

恰巧,在 GitHub 上发现了 CS7038-Malware-Analysis 这门完整的大学课程,为我们提供了一条从零到精通的系统学习路径。…

Gordon Lee🍀 reposted

wrote a blog on KV caching. link in comments.

prathamgrv's tweet image. wrote a blog on KV caching. link in comments.

Gordon Lee🍀 reposted

How to become expert at thing: 1 iteratively take on concrete projects and accomplish them depth wise, learning “on demand” (ie don’t learn bottom up breadth wise) 2 teach/summarize everything you learn in your own words 3 only compare yourself to younger you, never to others


Loading...

Something went wrong.


Something went wrong.