https://www.usc.gal/en/studies/masters/engineering-and-archi...
The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently.
Also, at least a portion of this you could argue is arbitrary and entirely scoped to the eval itself. The fp8 GEMM score could be low simply because one of the shapes is fairly skinny (i.e. not enough math work to keep the compute engine busy for a meaningful amount of time).
Way back when I wrote the OpenCL driver at Qualcomm, we would frequently get bug reports from customers complaining about our code. During my tenure, every single one of them was root-caused as an application bug. Unsurprisingly, considering that our code was backed by an extensive test suite and their code wasn't.
Not to say that our code was perfect, of course. But people have a tendency to blame GPU drivers when the problem often lies elsewhere.
It is also common in my experience for buggy GPU code to crash displays if the GPU is simultaneously used to drive a monitor. This usually happens for kernels that go into infinite loops, or out-of-memory conditions.
It is my understanding that modern GPU drivers even have watchdog systems that notice when they get stuck and forcibly reboot them, which to me is mere symptom treatment.
To be fair: the hardware is enormously complex, and the drivers much less so.
That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:
https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int...
but for a much more readable, and still fully transparent modern-C++ API version of the same, try this:
https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa...
that's a sample program for my CUDA API wrappers (header-only) library.
It is also much more friendly for library authors; and easier to wrap; and actually exposes a bunch of features the "runtime API" doesn't.
The difficulty with it is that there just so many API calls; dozens of calls just for copying, for example. That was part of my motivation for writing my wrappers - making the supposedly "lower-level" API more accessible and intuitive than the supposedly "higher-level" API; and better integrated with the other libraries: NVTX, NVRTC, PTX compiler, fatbin library etc.
> It's fun to develop while being able to change the code at runtime.
It's also _the_ way to debug your kernels: If you don't load them dynamically, you have to recompile your application or kernel test harness every time you make a change to the kernel.
(I prefer to read longer articles on my e-ink device via epub or PDF)
What happens when you run a CUDA kernel?
https://fergusfinn.com/blog/what-happens-when-you-run-a-gpu-kernel/