Bert Maher

@tensorbert

I’m a software engineer building high-performance kernels and compilers at Anthropic! Previously at Facebook/Meta (PyTorch, HHVM, ReDex)

Washington, DC

Joined December 2022

430Posts 3KFollowers 380Following

You might like

@tri_dao

@typedfemale

@hadisalmanX

@cHHillee

@marksaroufim

@1vnzh

@bwasti

@cgarciae88

@dzhulgakov

@asdf1234_0

@ezyang

@jon_barron

@DhruvBatra_

@konstmish

@zacharynado

Bert Maher reposted

Paul Graham

@paulg

Nov 13

One of the most common flaws of math textbooks is that they present only the logic, without the intuition. They give you the later, cleaned up version of the idea, which hides the way it was discovered.

Bert Maher

@tensorbert

Nov 13

This is all true, but Soumith is also one of the most brilliant strategic thinkers in the world. Some of us just fail a lot, dust ourselves off, and keep hacking the next day ☺️

Deedy

@deedydas

Nov 13

If you feel like giving up, you must read this never-before-shared story of the creator of PyTorch and ex-VP at Meta, Soumith Chintala. > from hyderabad public school, but bad at math > goes to a "tier 2" college in India, VIT in Vellore > rejected from all 12 universities for…

deedydas's tweet image. If you feel like giving up, you must read this never-before-shared story of the creator of PyTorch and ex-VP at Meta, Soumith Chintala.

&gt; from hyderabad public school, but bad at math
&gt; goes to a "tier 2" college in India, VIT in Vellore
&gt; rejected from all 12 universities for…

Bert Maher

@tensorbert

Nov 11

This got me thinking that both int and FP math is “emulated” via a pretty complex set of transistors. I wonder how many gates/transistors it takes to implement an int8 fma versus an fp8, e4m3 fma

Elon Musk

@elonmusk

Nov 9

As the number of bits drops, the difference between floating point and integer decreases until they are the same thing at 1 bit. “Floating point” is not real. It is emulated with 2 integers and a lot of complexity.

Bert Maher

@tensorbert

Nov 9

I’ve heard this complaint from a couple people recently, and I’m surprised because we optimized the launch path like a year ago and got it down to ~10us. There’s a now closed GitHub issue I filed with a microbenchmark - someone should run it, profile, and bring it down

maharshi

@maharshii

Nov 9

why is triton’s kernel launch cpu overhead so freaking high? the actual kernel takes 10x less execution time than to launch it and i can’t use cuda graphs because the shapes are dynamic.

Bert Maher

@tensorbert

Oct 30

It would be kind of cool if torch.compile could be used as a context manager, like: ``` some_custom_kernels() with torch.compile(): # do a bunch of easy pointwise stuff more_custom_kernels() ```

Bert Maher

@tensorbert

Sep 30

I am really enjoying using Claude Code with Sonnet 4.5! It's super smart, and super fast!

Claude

@claudeai

Sep 29

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

claudeai's tweet image. Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

Bert Maher reposted

Claude

@claudeai

Sep 29

Bert Maher

@tensorbert

Sep 24

😍

Lisan al Gaib

@scaling01

Sep 22

Tri Dao says Claude Code makes him 1.5x more productive and that it's quite helpful at writing Triton kernels

Bert Maher

@tensorbert

Sep 19

Good list! Simon’s and Pranjal’s matmul blogs were foundational to my understanding of GPU performance

Fleetwood

@fleetwood___

Sep 18

Some perf related must-reads: • How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: siboehm.com/articles/22/CU… • Outperforming cuBLAS on H100: a Worklog: cudaforfun.substack.com/p/outperformin… • Defeating Nondeterminism in LLM Inference: thinkingmachines.ai/blog/defeating… • Making Deep…

fleetwood___'s tweet card. Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example, you might observe that asking ChatGPT...

Defeating Nondeterminism in LLM Inference

Source: thinkingmachines.ai

Bert Maher

@tensorbert

Sep 18

lol, there is quite the explosion of kernel DSLs lately (triton, tilelang, gluon, TLX, cuteDSL, cuTile, …) And honestly as much as I love TLX and want it to succeed, I think the next big kernel programming language might be… natural, human language

SzymonOzog (NeurIPS)

@SzymonOzog_

Sep 17

Just one more DSL bro. I promise bro just one more DSL and we'll fix hardware adoption. It's just a better DSL bro. Please just one more. One more DSL and we'll port all the kernels. I just need one more DSL

Bert Maher

@tensorbert

Sep 18

A great example of how subtle numerical issues can bite you. Fusing multiply-add into FMA strictly improves precision of that operation (the mul is computed in infinite precision). But it breaks the larger expression if other parts are not computed with the same high precision!

tender

@tenderizzation

Sep 17

where have I seen something like this before "This caused a mismatch: operations that should have agreed on the highest probability token were running at different precision levels. The precision mismatch meant they didn't agree on which token had the highest probability." see…

tenderizzation's tweet image. where have I seen something like this before
"This caused a mismatch: operations that should have agreed on the highest probability token were running at different precision levels. The precision mismatch meant they didn't agree on which token had the highest probability." see…

Bert Maher

@tensorbert

Sep 18

Debugging subtle numerical issues in ML systems - especially in the compiler! - is really freaking hard. Very impressed by the debugging that went into discovering these

Claude

@claudeai

Sep 17

In our investigation, we uncovered three separate bugs. They were partly overlapping, making diagnosis even trickier. We've now resolved all three bugs and written a technical report on what happened, which you can find here: anthropic.com/engineering/a-…

claudeai's tweet card. This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we're changing.

A postmortem of three recent issues

Source: anthropic.com

Bert Maher

@tensorbert

Sep 17

I found myself wondering if we might benefit from a resurgence of Halide- (or TVM) -like ideas, where you have some math to optimize, and a library of optimizations through which the machine can search to find the best outcome

Jeremy Howard

@jeremyphoward

Sep 17

Bert Maher

@tensorbert

Sep 17

What if instead of a paperclip maximizer we built and hoodie-and-bag maximizer, and what if it turns out we’re already living in its world?? 😱

Bert Maher

@tensorbert

Sep 17

We put some hooks for hoodies and bags above the bench to put our shoes on by the front door. Of course its still only a matter of time before hoodies and bags conquer everything

Bert Maher

@tensorbert

Sep 17

I’ve seen this “FAANG vibe code” post a few times and tbh this workflow sounds… tedious. “Write a design doc, align stakeholders, iterate on design review, etc” How about: have an awesome idea and build it