Remove Duplicates From Sorted Array

A variant (hard) on filter collect.

So, today we are going to deduplicate elements in a sorted vec. The algorithm has to be in place.

We take as a test the following vec:

let v: Vec<i32> = (0..SIZE)
    .flat_map(|i| std::iter::repeat(i as i32).take(10))
    .collect();

SIZE is large and each element is repeated 10 times. This is important for us: we consider no one is repeated too much (in which case we would need a slightly different solution).

Let’s start with an easy sequential solution: we can just do:

v.dedup()

Hooray ! It is already there in the standard library :-)

On my laptop I get around 93 ms.

Now in order to prepare for the parallel code I will write a different sequential algorithm:

fn move_forward(s: &mut [i32]) -> usize {
    (1..s.len()).fold(0, |i, j| {
        if s[i] == s[j] {
            i
        } else {
            s[i + 1] = s[j];
            i + 1
        }
    }) + 1
}

fn remove_duplicates(nums: &mut Vec<i32>) {
    let new_len = move_forward(nums.as_mut_slice());
    nums.truncate(new_len);
}

In this code the move_forward function is using two indices. i marks the last non repeated value and j the currently studied value. If we detect a repetition we just advance to next j while if we don’t then the value at j is moved forward at i+1.

On my laptop this code is slightly faster than the standard library code around 90ms.

Can we get any speedup on such a problem ?

Parallel Solution

So, I hope you recognized it from the course. This is actually close from a filter collect. The handling of slices adds an extra layer of difficulty though.

Ok, let’s start. We are going to create a parallel iterator on sub-slices using rayon::iter::split.

fn par_remove_duplicates(nums: &mut Vec<i32>) {
    let mut list = rayon::iter::split((0, nums.as_mut_slice()), |(start, s)| {
        if s.len() < 100_000 {
            ((start, s), None)
        } else {
            let mid = s.len() / 2;
            let mid_value = s[mid];
            // we assume it does not repeat more than 50k times
            let end = (0..mid).rev().find(|&i| s[i] != mid_value).unwrap();
            let (left, right) = s.split_at_mut(mid);
            let (far_left, _) = left.split_at_mut(end);
            ((start, far_left), Some((start + mid, right)))
        }
    })

If asked, we cut our slice into two. At the same time we directly handle the midpoint by ensuring its value is only available in the right sub-slice. Because some duplicated midpoints are skipped, we will need to additionally remember the start index of each sub-slice.

    .map(|(start, s)| (start, move_forward(s)))

Now we remove (in parallel) all duplicates in each slice.

    .map(|d: (usize, usize)| std::iter::once(d).collect())
    .reduce(LinkedList::new, |mut l1, mut l2| {
        l1.append(&mut l2);
        l1
    })

We collect a linked list of for each slice : its starting position and length. We will use these coordinates to move back our blocks to their final positions.

We do it sequentially:

    .into_iter();
    let (_, pos) = list.next().unwrap();
    let new_len = list.fold(pos, |pos: usize, (start, size)| {
        nums.copy_within(start..start + size, pos);
        pos + size
    });

The copy_within method is handy here. Theoretically we could move in parallel but:

  • it is a bit cheating because we can only do it in parallel if there is no overlap
  • since it is so memory intensive there is no real point of doing that
    nums.truncate(new_len);
}

Ok, do we really get any performance out of that ? Can you compute the Work and the Depth ? Do you think it will work in practice ?

Analysis

In our example something nice is that we go from 100 million integers down to 10 millions. So the second sequential phase is a pure overhead but will be less than 10% of the whole work.

The first phase of the algorithms moves forward with the same work as the sequential algorithm. There is an extra cost for the splits but we can consider it to be O(1). The list reduction costs O(t) where t is the number of tasks created in the first phase (between p and p * log(n)) because we have a reduction tree of t initial nodes so 2t nodes in total where each node costs 1 (list concatenations).

Now the last step is the sequential update. This is a pure overhead but how much ? Well, the good news is that we already de-deduplicated the input so we divided the number of elements by 10. What is more all comparisons are already done, only memory movements are left to be done. This means that the remaining cost is O(n) but if we look closely it is c * n/10 where c is less than 1.

The depth on the other hand is log(n) + c*n/10.

Now the last part of the analysis is to estimate a little bit the pressure on the memory bus. It does not seem too good because there are a lot of memory movements for few computations (the comparisons essentially). This means that we will not get a speedup of 2 (for 2 threads) and that increasing threads further will most likely not improve performances.

Plugging it in the work-stealing formula we get an execution time of: W/p+D=45+9=54 so 54ms without consider the memory impact.

Running it I get 60ms (with no logs).

start 0 end 91709 duration 91.00us start 206728 end 220376 duration 13.00us start 221556 end 222160 duration 604ns start 222588 end 223024 duration 436ns start 223415 end 14036669 duration 13.00ms start 14039326 end 20187773 duration 6.00ms start 20189458 end 20189857 duration 399ns start 20190959 end 20192865 duration 1.00us start 20194249 end 26354247 duration 6.00ms start 26355895 end 32511867 duration 6.00ms start 32513490 end 32513662 duration 172ns start 32513783 end 32514072 duration 289ns start 32515476 end 32516810 duration 1.00us start 32518067 end 32518360 duration 293ns start 32518560 end 38672283 duration 6.00ms start 38674082 end 44840314 duration 6.00ms start 44841296 end 44841549 duration 253ns start 44842628 end 44844148 duration 1.00us start 44845425 end 51283453 duration 6.00ms start 51286203 end 57479905 duration 6.00ms start 57481791 end 57482332 duration 541ns start 57482451 end 57482687 duration 236ns start 57482763 end 57483164 duration 401ns start 212351 end 222620 duration 10.00us start 223373 end 223946 duration 573ns start 224401 end 224898 duration 497ns start 225314 end 14059675 duration 13.00ms start 14061176 end 20214782 duration 6.00ms start 20215453 end 20215741 duration 288ns start 20216262 end 20217300 duration 1.00us start 20218317 end 26402893 duration 6.00ms start 26403783 end 32537849 duration 6.00ms start 32538787 end 32538978 duration 191ns start 32539056 end 32539172 duration 116ns start 32539565 end 32540058 duration 493ns start 32540427 end 32540845 duration 418ns start 32541032 end 38697837 duration 6.00ms start 38699098 end 44849769 duration 6.00ms start 44850446 end 44850604 duration 158ns start 44851033 end 44851456 duration 423ns start 44851835 end 51515265 duration 6.00ms start 51516075 end 57657172 duration 6.00ms start 57657696 end 57657815 duration 119ns start 57657937 end 57658083 duration 146ns start 57658243 end 57658494 duration 251ns start 57716803 end 66306479 duration 8.00ms

As you can see the update phase is not that costly here. Both threads are however competing for the memory bandwidth (moving forward tasks are faster on a single thread).