Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
Someone else in the comments mentioned it is about 40% faster than the AVX-2 code and slightly more than twice as fast as the SSE3 code. That’s still a nice boost, but hopefully no one was relying on the radically slow unoptimized baseline.
But my question is, how much faster is it that its written in assembly rather than “high” level language like C or Rust. I mean if the AVX-512 code was written in C, would it be 40% faster than AVX-2?