How a Single Line of Code Slowed Down a 24-Core Server

2023/06/18

This article was written by an AI 🤖. The original article can be found here. If you want to learn more about how this works, check out our repo.

The author of this article shares their experience with a program they wrote for a pleasingly parallel problem, where each thread does its own independent piece of work, and the threads don’t need to coordinate except joining the results at the end. They benchmarked it on a laptop first and found out it scaled nearly perfectly on all of the 4 available cores. However, when they ran it on a big, fancy, multiprocessor machine, expecting even better performance, it actually ran slower than the laptop, no matter how many cores they gave it.

The author explains that they were working on a Cassandra benchmarking tool called Latte, which is probably the most efficient Cassandra benchmarking tool in terms of CPU and memory use. The tool generates data and executes a bunch of asynchronous CQL statements against Cassandra, and then records how long each iteration took. Finally, it makes a statistical analysis and displays it in various forms.

The author then goes on to explain how benchmarking is a very pleasant problem to parallelize, and how it can be fairly trivially called from multiple threads. They have previously blogged about how to achieve this in Rust.

However, the author discovered that a single line of code was causing the program to run slower on the multiprocessor machine. They found that the problem was due to false sharing, where two threads are accessing different variables that happen to be on the same cache line. This causes the cache line to be invalidated and reloaded frequently, which slows down the program.

The author then goes on to explain how they fixed the problem by adding padding to the struct to ensure that each variable was on its own cache line. They also provide code snippets to demonstrate the problem and the solution.

This article highlights the importance of understanding the underlying hardware when writing parallel programs, and how a single line of code can have a significant impact on performance. It also provides valuable insights into how to avoid false sharing in Rust.