Parallelizing nested loops in rust with rayon

Solution 1:

What you're trying to do is not as trivial as you might imagine :D But let's give it a shot!

First, let's make a minimal reproducible example, - this is the common way to ask questions on stackoverflow. As you can imagine, we don't know what your code should do. Nor do we have the time to try and figure it out.
We would like to get a simple code piece, which fully describes the problem, copy-paste it, run it and derive a solution.

So here's my minimal example:

#[derive(Debug)]
pub struct Node {
    value: i32,
    other_value: i32,
}

fn repulsion_force(object: &mut [Node]) {
    for i in 0..object.len() {
        for j in i + 1..object.len() {
            let mi = 2 * object[i].value;
            let mj = mi + object[j].value;
            object[i].other_value -= mi;
            object[j].other_value += mj;
        }
    }
}

Firstly i've created a simple node type. Secondly, i've simplified the operations. Note that instead of passing a vector, i'm passing a mutable slice. This form retains more flexibility, in case I migth need to pass a slice form an array for exmaple. Since you're not using push(), there's no need to reference a vector.

So next let's reformulate the problem for parallel computation. First consider the structure of your loops and access pattern. Your're iterating over all the elements in the slice, but for each i iteration, you're only modifying the object at [i] and [j > i]. so let's split the slice according to that pattern

fn repulsion_force(object: &mut [Node]) {
    for i in 0..object.len() {
        let (left, right) = object.split_at_mut(i + 1);
        let mut node_i = &mut left[i];
        right.iter_mut().for_each(|node_j| {
            let mi = 2 * node_i.value;
            let mj = mi + node_j.value;
            node_i.other_value -= mi;
            node_j.other_value += mj;
        });
    }
}

By splitting the slice we are getting two slices. The left slice contains [i], the right slice contains [j > i]. next we rely on an iterator instead of indices for the iteration.

The next step would be to make the internal loop parallel. However, the internal loop modifies node_i at each iteration. That means more than one thread might try to write to node_i at the same time, causing a data race. As such the compiler won't allow it. The solution is to include a synchronization mechanism. For a general type, that might be a mutex. But since you're using standard mathematical operations i've opted for an atomic, as these are usually faster. So we modifiy the Node type and the internal loop to

#[derive(Debug)]
pub struct Node {
    value: i32,
    other_value: AtomicI32,
}

fn repulsion_force(object: &mut [Node]) {
    for i in 0..object.len() {
        let (left, right) = object.split_at_mut(i + 1);
        let mut node_i = &mut left[i];
        right.iter_mut().par_bridge().for_each(|node_j| {
            let mi = 2 * node_i.value;
            let mj = mi + node_j.value;
            node_i.other_value.fetch_sub(mi, Relaxed);
            node_j.other_value.fetch_add(mj, Relaxed);
        });
    }
}

you can test the code with the snippet

fn main() {
    // some arbitrary object vector
    let mut object: Vec<Node> = (0..100).map(|k| Node { value: k, other_value: AtomicI32::new(k) }).collect();
    repulsion_force(&mut object);
    println!("{:?}", object);
}

Hope this help! ;)