CUDAHandbook's profile picture. Nicholas Wilt was on the inception team for CUDA, wrote The CUDA Handbook, and writes at https://parallelprogrammer.substack.com

Nicholas Wilt

@CUDAHandbook

Nicholas Wilt was on the inception team for CUDA, wrote The CUDA Handbook, and writes at https://parallelprogrammer.substack.com

Well when you lay it all out like that, yeah, it does seem implausible. But all true

The deeper you go into the semiconductor supply chain, the less believable it becomes. > TSMC, a company on a small island, produces over 90% of the world’s most advanced chips > TSMC relies on dutch company ASML for EUV lithography machines > ASML depends on German Company…



In a previous life, I optimized some GPU-assisted code that was being operated by an interpreter. Full CUDA port was 60x faster. I think the GPU often was done before control had been returned to the caller.

I wonder how many H100 hours have been burned waiting a handful of milliseconds for python to "glue" the next task to the previous one



Well, hold on a sec. They also won because even without TensorCores, they were compellingly faster than CPUs. Geoffrey Hinton reportedly was touting GPUs for AI in 2009!

gpus didn’t win because they had magic math units. they won because cuda exposed a control plane that let people experiment packing work close to memory, overlapping loads, hiding latency. specialised ai chips bake these optimizations directly into hardware but that’s a wager on…



Whomever said this has never ported a workload to CUDA for a 100x speedup and I feel sorry for them

There's a saying, "if you get a 5x speedup you did something smart, if you get a 100x speedup you stopped doing something dumb", but that's just not true in SAT solving! I just got a 3000x speed-up in this search (from 15 days to 6 minutes), from making better SAT instances.



With benefit of hindsight, it may have been a mistake to reorg Windows under Azure. I have been a PC/Microsoft partisan for 40 years, worked on Windows while at Microsoft, and may eschew Windows on my next laptop.

Microsoft is plugging more holes that let you use Windows 11 without an online account. Microsoft really doesn’t want you creating a local account on Windows 11. Full details on the changes 👇 theverge.com/news/793579/mi…



Banning PowerPoint has other salubrious effects. Amazon famously banned PowerPoint decades ago, because it enables companies to lie to themselves. I have little to no respect for Sun on other matters: business, IP, technological. inc.com/justin-bariso/…

In a 1997 interview with BYTE, Scott McNealy, then CEO of Sun Microsystems, arguably the most renowned Unix company of the time, spoke disparagingly about Unix, remarking, “The problem with Unix is that nobody protected the brand to mean something and the brand lost value”. At…

unix_byte's tweet image. In a 1997 interview with BYTE, Scott McNealy, then CEO of Sun Microsystems, arguably the most renowned Unix company of the time, spoke disparagingly about Unix, remarking, “The problem with Unix is that nobody protected the brand to mean something and the brand lost value”. At…


The weather app on my phone is p good

What's a piece of software you use regularly that you can't think of a single complaint about?



I was using the word “insane,” too, but maybe in a different sense. The clock cycles were Intel’s to lose. I wrote an article about this: “Why CUDA Succeeded” open.substack.com/pub/parallelpr…

tbh, Intel's software stack is actually insane. they have tools for everything. they could have been The Nvidia if only...

mu_chrinovic's tweet image. tbh, Intel's software stack is actually insane. they have tools for everything. they could have been The Nvidia if only...


I wrote a sort historical take on the evolution of SIMD on x86: open.substack.com/pub/parallelpr…

For those wondering what is AVX? Your CPU normally, on each core, applies one operation on one piece of data. This gets quite slow for bigger data :( So why not use the GPU? That adds latency and overhead. What if we want parallel operations, but it isn’t worth offloading?…

xXshaurizardXx's tweet image. For those wondering what is AVX?

Your CPU normally, on each core, applies one operation on one piece of data. This gets quite slow for bigger data :(
 
So why not use the GPU? That adds latency and overhead. What if we want parallel operations, but it isn’t worth offloading?…


I pushed CMake files into the CUDA Handbook repository, to make it easier to build the samples. Among other things, the exercise underscored the need to revisit the code that uses texturing (CUDA12 no longer supports texture references). Feedback welcome! github.com/ArchaeaSoftwar…


Next stop in this journey: cache lines 👀

The next crushing fact: CPUs don't think in bytes, they think in words (they operate on registers, usually word-sized chunks).



This article describes how to do the insertion step for Insertion Sort into an AVX2 register. A combination of shifts, XOR, and blends after the compare. (link in reply)


Loading...

Something went wrong.


Something went wrong.