Tuning Rust Code for Performance

Since most of you guys know little about OS, we won't talk much about low-level threading concepts, like synchronization, memory barriers and so on. We will only focus on embarrassingly parallel problems, leveraging Rust's thread safety. Next, we will have a simple look at SIMD, and yet this does not require you to handwrite SIMD code, as Rust provides a good abstraction for it. Finally, we will talk about some ad-hoc optimizations for this challenge.

Parallelism

Everything we talked so far is single-threaded, which is not enough for modern computing. We need to leverage multiple cores to speed up our code. Rust provides a simple way to do this, using the std::thread module. But this is not intuitive at first, so will use another one called rayon. This one provides easy-to-use parallel iterators.

Before we start, you need to add rayon to your Cargo.toml:

cargo add rayon

Then you can use it like this:

use rayon::prelude::*;

fn main() {
    let data = vec![1, 2, 3, 4, 5];
    let doubled: Vec<_> = data.par_iter().map(|&x| x * 2).collect();
}

This snippet will run map in parallel, as the elements in data are independent of each other. This is called embarrassingly parallel, which means that the problem can be easily divided into smaller independent tasks. If you need to create independent tasks other than the iterators, you can use rayon::spawn:

use rayon::prelude::*;

fn main() {
    let data = vec![5, 1, 8, 22, 0, 44];
    rayon::join(
        || println!("Sum: {}", foo(&data)),
        || println!("Double Sum: {}", bar(&data)),
    );
}

fn foo(arr: &Vec<i32>) -> i32 {
    arr.iter().sum()
}

fn bar(arr: &Vec<i32>) -> i32 {
    arr.map(|x| x * 2).iter().sum()
}

To further optimize rayon, we have to understand where the bottleneck is. One problem is that the workload above is trivial and rayon just simply splits them into separate threads. If context switching pops out in your mind, you are right. The overhead is largely due to the uncessary threads and their costs. To avoid this, we can use par_chunks to split the data into chunks in one thread each to keep cores busy

use rayon::prelude::*;

fn main() {
    let data = vec![1, 2, 3, 4, 5];
    let doubled: Vec<_> = data.par_chunks(1024).flat_map(|chunk| {
        chunk.iter().map(|&x| x * 2)
    }).collect();
}

SIMD

Single Instruction Multiple Data (SIMD) is a low-level technique in CPU. Since most of you are not familiar with computer architecture, we won't go that deep. To explain it in a few sentences, SIMD unrolls the loop and executes the same operation on multiple data at once. It's pretty like rayon, but at a lower level. This is a powerful technique to speed up numerical computing, but unfortunately, Rust has not fully supported it yet. To turn on SIMD, you need to switch to the nightly version of Rust and enable features flags.

rustup default nightly

The simplest way to use SIMD in Rust is to hint compiler to use SIMD instructions but for now, compiler isn't smart enough to do this automatically. Also, SIMD has its limitations, like it works at maximum speed only when the data is aligned and independent.

The current simplest way to use SIMD in Rust is portable-simd. Let's take sum again as the example

#![allow(unused)]
#![feature(portable_simd)]  // Release the nightly compiler kraken
fn main() {
use multiversion::{multiversion, target::selected_target};
use std::simd::prelude::*;

#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
fn simd_sum(x: &Vec<f32>) -> f32 {
    const SIMD_WIDTH: usize = const { 
        selected_target!().suggested_simd_width::<f32>().unwrap_or(1)
    };

    let (peel, body, tail) = x.as_simd::<SIMD_WIDTH>();
    let simd_sum = body.into_iter().sum::<Simd<f32, SIMD_WIDTH>>();
    let scalar_sum = peel.into_iter().chain(tail).sum::<f32>();
    simd_sum.reduce_sum() + scalar_sum
}
}

The worst way but the most compatible way to use SIMD in Rust is to use std::arch module. And you need to handwrite assembly code. Well, you don't need to do this in this challenge, but it's good to know that you can do it if you want.

Compilation

The last thing we need to talk about is the compilation. Rust has a powerful optimization system, and you can enable it by setting the release profile in your Cargo.toml:

[profile.release]
opt-level = 3
lto = "fat"

opt-level controls the optimization level made automatically by the comiler, and lto enables link-time optimization, which allows the compiler to optimize across crate boundaries, i.e., by thinking about the program as a whole. Meanwhile, you can enable feature flags to optimize the code further. For example, you can enable simd feature flag to use SIMD instructions in your code. You can also use multiversion to enable multiple versions of the same function for different targets, which can be useful for performance optimization.

When compiling, use the --release flag to enable the release profile:

cargo build --release