wgpu-mm

How many FLOPS can we squeeze out of wgpu? The test harness is inspired by Bram Wasti's work here.

GEMM

The M1 8 core GPU can supposedly hit 2.6 TFLOPS of FP32.

A custom metal shader from Tinygrad can hit 2000 GFLOPS or ~75% of theoretical peak. This shader uses SIMD groups which WebGPU doesn't support yet - but it's been proposed a few times e.g here.

The best shader we have is an altered version of that by Tensorflow.JS, which reaches ~900GFLOP on my M1.

GEMV

GEMV is a different problem since it is entirely memory-bound.

The M1 7 core GPU has a memory bandwidth of 66.7 GB/s. We use the formula for bandwidth to be M (GB/s) = M=10-9.(m.n+m+n)*sizeof(scalar type)/T.

For the problem size [1,384] @ [384, 51868] (Whisper logits GEMV), we can calculate the minimum possible runtime to be 1198266.33ns. The best kernel in here, gemv_2, hits ~1300000ns.

As it is memory bound, lower precision is extremely important. We can see our HGEMV can perform the same [1,384] @ [384, 51868] in ~694500ns, ~2x faster.

Our QGEMV can perform the same 1,384] @ [384, 51868] in ~342000ns, ~2x faster again.

TODO

[ ] - Flash Attention [ ] - Fast transposed GEMV

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
msl_out		msl_out
shaders		shaders
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
metal_matmul.py		metal_matmul.py
raw_tfjs_gemm.wgsl		raw_tfjs_gemm.wgsl
raw_tfjs_gemv.wgsl		raw_tfjs_gemv.wgsl
scratch		scratch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wgpu-mm

GEMM

GEMV

Read More

TODO

About

Releases

Packages

Languages

FL33TW00D/wgpu-mm

Folders and files

Latest commit

History

Repository files navigation

wgpu-mm

GEMM

GEMV

Read More

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages