Optimizing Rust Code: Achieving 128x Speedup
A recent blog post on Hacker News caught the attention of developers, claiming a program was "n" times faster than C. The author demonstrated a 6x speedup by tweaking the assembly of a tight loop, and an impressive 13.5x speedup by rewriting the program in assembly. Inspired by this, the author set out to achieve an even greater speedup.
Using Rust (rustc 1.70.0), the author tackled a problem involving a string of 's' and 'p' characters. By rewriting the logic using iterator combinators, the author achieved a 14.7x speedup over the baseline C program. The compiler aggressively unrolled and vectorized the code, resulting in significant performance gains.
Taking the optimization a step further, the author discovered that the function could be simplified to f(input) = count(input, 's') - (len(input) - count(input, 's')). By eliminating branches and making the loop body branchless, an additional 49% improvement was achieved.
In the end, the final program achieved a remarkable speedup of 128x (36 GiB/s throughput) by reformulating the problem and leveraging SIMD intrinsics. This showcases the power of optimizing code at different levels of abstraction and the capabilities of Rust's compiler.
For developers seeking to improve their code's performance, this article highlights the importance of exploring optimization opportunities, from assembly tweaks to higher-level abstractions. The provided code examples and insights offer valuable lessons for those looking to keep up with the latest advancements in programming languages and frameworks.
Code available at Github repo.
Note: All benchmarks were performed on a Macbook Pro 2021 (Apple M1 Pro) using Rust 1.70.0.