Remove Duplicates From Sorted Array

A variant (hard) on filter collect.

So, today we are going to deduplicate elements in a sorted vec. The algorithm has to be in place.

We take as a test the following vec:

let v: Vec<i32> = (0..SIZE)
    .flat_map(|i| std::iter::repeat(i as i32).take(10))
    .collect();

SIZE is large and each element is repeated 10 times. This is important for us: we consider no one is repeated too much (in which case we would need a slightly different solution).

Let’s start with an easy sequential solution: we can just do:

v.dedup()

Hooray ! It is already there in the standard library :-)

On my laptop I get around 93 ms.

Now in order to prepare for the parallel code I will write a different sequential algorithm:

fn move_forward(s: &mut [i32]) -> usize {
    (1..s.len()).fold(0, |i, j| {
        if s[i] == s[j] {
            i
        } else {
            s[i + 1] = s[j];
            i + 1
        }
    }) + 1
}

fn remove_duplicates(nums: &mut Vec<i32>) {
    let new_len = move_forward(nums.as_mut_slice());
    nums.truncate(new_len);
}

In this code the move_forward function is using two indices. i marks the last non repeated value and j the currently studied value. If we detect a repetition we just advance to next j while if we don’t then the value at j is moved forward at i+1.

On my laptop this code is slightly faster than the standard library code around 90ms.

Can we get any speedup on such a problem ?

Parallel Solution

So, I hope you recognized it from the course. This is actually close from a filter collect. The handling of slices adds an extra layer of difficulty though.

Ok, let’s start. We are going to create a parallel iterator on sub-slices using rayon::iter::split.

fn par_remove_duplicates(nums: &mut Vec<i32>) {
    let mut list = rayon::iter::split((0, nums.as_mut_slice()), |(start, s)| {
        if s.len() < 100_000 {
            ((start, s), None)
        } else {
            let mid = s.len() / 2;
            let mid_value = s[mid];
            // we assume it does not repeat more than 50k times
            let end = (0..mid).rev().find(|&i| s[i] != mid_value).unwrap();
            let (left, right) = s.split_at_mut(mid);
            let (far_left, _) = left.split_at_mut(end);
            ((start, far_left), Some((start + mid, right)))
        }
    })

If asked, we cut our slice into two. At the same time we directly handle the midpoint by ensuring its value is only available in the right sub-slice. Because some duplicated midpoints are skipped, we will need to additionally remember the start index of each sub-slice.

    .map(|(start, s)| (start, move_forward(s)))

Now we remove (in parallel) all duplicates in each slice.

    .map(|d: (usize, usize)| std::iter::once(d).collect())
    .reduce(LinkedList::new, |mut l1, mut l2| {
        l1.append(&mut l2);
        l1
    })

We collect a linked list of for each slice : its starting position and length. We will use these coordinates to move back our blocks to their final positions.

We do it sequentially:

    .into_iter();
    let (_, pos) = list.next().unwrap();
    let new_len = list.fold(pos, |pos: usize, (start, size)| {
        nums.copy_within(start..start + size, pos);
        pos + size
    });

The copy_within method is handy here. Theoretically we could move in parallel but:

it is a bit cheating because we can only do it in parallel if there is no overlap
since it is so memory intensive there is no real point of doing that

    nums.truncate(new_len);
}

Ok, do we really get any performance out of that ? Can you compute the Work and the Depth ? Do you think it will work in practice ?

Analysis

In our example something nice is that we go from 100 million integers down to 10 millions. So the second sequential phase is a pure overhead but will be less than 10% of the whole work.

The first phase of the algorithms moves forward with the same work as the sequential algorithm. There is an extra cost for the splits but we can consider it to be O(1). The list reduction costs O(t) where t is the number of tasks created in the first phase (between p and p * log(n)) because we have a reduction tree of t initial nodes so 2t nodes in total where each node costs 1 (list concatenations).

Now the last step is the sequential update. This is a pure overhead but how much ? Well, the good news is that we already de-deduplicated the input so we divided the number of elements by 10. What is more all comparisons are already done, only memory movements are left to be done. This means that the remaining cost is O(n) but if we look closely it is c * n/10 where c is less than 1.

The depth on the other hand is log(n) + c*n/10.

Now the last part of the analysis is to estimate a little bit the pressure on the memory bus. It does not seem too good because there are a lot of memory movements for few computations (the comparisons essentially). This means that we will not get a speedup of 2 (for 2 threads) and that increasing threads further will most likely not improve performances.

Plugging it in the work-stealing formula we get an execution time of: W/p+D=45+9=54 so 54ms without consider the memory impact.

Running it I get 60ms (with no logs).

As you can see the update phase is not that costly here. Both threads are however competing for the memory bandwidth (moving forward tasks are faster on a single thread).