How many FLOPS can we squeeze out of wgpu? The test harness is inspired by Bram Wasti's work here.
The M1 8 core GPU can supposedly hit 2.6 TFLOPS of FP32.
A custom metal shader from Tinygrad can hit 2000 GFLOPS or ~75% of theoretical peak. This shader uses SIMD groups which WebGPU doesn't support yet - but it's been proposed a few times e.g here.
The best shader we have is an altered version of that by Tensorflow.JS, which reaches ~900GFLOP on my M1.
GEMV is a different problem since it is entirely memory-bound.
The M1 7 core GPU has a memory bandwidth of 66.7 GB/s. We use the formula for bandwidth to be M (GB/s) = M=10-9.(m.n+m+n)*sizeof(scalar type)/T.
For the problem size [1,384] @ [384, 51868] (Whisper logits GEMV), we can calculate the minimum possible runtime to be 1198266.33ns. The best kernel in here, gemv_2, hits ~1300000ns.
As it is memory bound, lower precision is extremely important. We can see our HGEMV can perform the same [1,384] @ [384, 51868] in ~694500ns, ~2x faster.
Our QGEMV can perform the same 1,384] @ [384, 51868] in ~342000ns, ~2x faster again.
[ ] - Flash Attention [ ] - Fast transposed GEMV