Number of Lines to Write a String

A nifty parallel algorithm which seems difficult at first glance.

Continuing with some leetcode problems, I present today my take on number of lines to write a string.

We have as input a String containing only letters (we will turn it into a bytes slice with .as_bytes()). What is more we also know the width of any letter, stored in a width slice such that width[0] contains the width of ‘a’, width[1] the width of ‘b’ and … width[25] the width of ‘z’.

We want to print a text but our page only has a total width of 100 so we need to print it on several lines. The algorithm will greedily fill each line sequentially until we try to fill the current line but there is not enough space left. If a character does not fit the remaining space we just start a new line and put it there. All chars have a width <= 100.

Your goal is to find the number of lines needed and the space used on the last line.

Here is a small project to complete.

pub fn number_of_lines(widths: &[i32], s: &str) -> (i32, i32) {
    unimplemented!()
}
Sequential Solution

Ok, here is the sequential solution.

pub fn number_of_lines(widths: &[i32], s: &str) -> (i32, i32) {
    s.as_bytes()

We start by looking at the string like a slice of bytes. This will speed up iterations and allow us some more parallel operations later on.

        .iter()

Let’s loop on all chars.

        .map(|c| *c as usize - 'a' as usize)

Turn each char into the corresponding index between 0 and 25.

        .map(|i| widths[i])

Now each index into the corresponding width so we end up looping on all widths.

        .fold((1, 0), |(lines, used_space), s| {
            if used_space + s > 100 {
                (lines + 1, s)
            } else {
                (lines, used_space + s)
            }
        })
}

Last step, we fold, producing the lines count and the space count. For each width s if it fits we update the space used. If it does not fit the current line we increase the line count and reset space use.

Parallel Solution

So, the parallel solution does not seem trivial because of the fold which propagates a sequential state. At first glance it seems even very bad. Let’s take for example the widths [45, 60, 13, 22, 17, 80]. If we divide and try to recurse we end up with

  • 45, 60, 13 which uses 2 lines with a use of 13 on the last line
  • 22, 17, 80 uses 2 lines with a use of 80 on the last line

The final answer should be [45] [60,13,22] [17, 80] so three lines and 97 space used on the last line.

Now how should we turn (2,13) and (2,80) into (3, 97) ??

It seems bad, right ?

Well, turns out there is a solution:

First I’ll start by adding an intermediate sequential function:

fn scan_string<'a>(
    init: (i32, i32),
    widths: &'a [i32],
    s: &'a [u8],
) -> impl Iterator<Item = (i32, i32)> + 'a {
    s.iter()
        .map(|c| *c as usize - 'a' as usize)
        .map(move |i| widths[i])
        .scan(init, |(ref mut lines, ref mut used_space), s| {
            if *used_space + s > 100 {
                *lines += 1;
                *used_space = s;
            } else {
                *used_space += s;
            }
            Some((*lines, *used_space))
        })
}

This is very similar to our previous code except that the fold is replaced by a scan. This will allow us to loop on all partial results.

Ok, let’s get started.

pub fn par_number_of_lines(widths: &[i32], s: &str) -> (i32, i32) {
    let letters = s.as_bytes();
    let n = letters.len();
    let mut res = letters
        .par_chunks(n / 8)

Let’s take a parallel iterator on 8 subslices. We could have gone with dynamic scheduling but it will be easier to control what’s going on this way.

        .map(|s| (s, scan_string((1, 0), &widths, s).last().unwrap()))

Each slice gets solved sequentially (this time using the scan). Note that we also return the slice itself since we will need it for fusing partial results later on.

        .map(|t| std::iter::once(t).collect::<LinkedList<_>>())
        .reduce(LinkedList::new, |mut l1, mut l2| {
            l1.append(&mut l2);
            l1
        })
        .into_iter();

We turn each result into a linked list and reduce to obtain a list of partial results which is then turned into a sequential iterator.

We now need to loop on all partial results and fuse them.

    let init = res.next().map(|(_, r)| r).unwrap_or((0, 0));

Let’s initialize the final result with the result from the first slice and continue by fusing all results sequentially.

    res.fold(
        init,
        |(previous_lines, previous_space): (i32, i32), (slice, (lines, space))| {
            let (l, s) = match scan_string((0, previous_space), &widths, slice)
                .zip(scan_string((1, 0), &widths, slice))
                .try_fold(
                    (0, 0),
                    |_, ((new_line, new_space), (old_line, old_space))| {
                        if old_space == new_space {
                            Err((lines + new_line - old_line, space))
                        } else {
                            Ok((new_line, new_space))
                        }
                    },
                ) {
                Ok(r) => r,
                Err(r) => r,
            };
            (previous_lines + l, s)
        },
    )
}

Wow wow wow, what is that ? Let’s take again an example, this time with sizes [30, 20, 10, 80, 17, 90, 40, 24]. We have:

  • on the left : [30, 20, 10] [80] which gives 2 lines and space 80,
  • on the right : [17] [90] [40, 24] which gives 3 lines and space 64,
  • the ‘real’ solution : [30, 20, 10] [80, 17] [90] [40, 24] which gives 4 lines and space 64.

Now the nice observation is that [90] [40, 24] is actually the same in the partial solution and in the final solution. This means that if we have the partial solutions then when the real solution reaches ‘90’ there is no need to continue the computations.

So what we can do is compare the solution and the partial solution until there is a match. We don’t store solutions but scan_string is able to loop on them lazily. We can play the real solution by calling:

scan_string((0, previous_space), &widths, slice)

(initial line does not count but we must continue from previous space) and the partial one by calling (again):

scan_string((1, 0), &widths, slice)

(we start a new line with no used space).

So we zip both iterators and replay them until the amount of space used matches. From then, we can abort the loop because both iterators will advance in sync. So we use a try_fold to be able to abort the loop when spaces become equal. The ’else’ branch correspond to a last on the real solution iterator but the ‘if’ branch allows us an early break.

The amazing stuff is that while the worst case is catastrophic O(n), it actually converges very fast in practice.

Here is a log of a run (the given code) if you don’t believe me:

start 0 end 82988 duration 82.00us start 128589 end 139450 duration 10.00us start 140483 end 140830 duration 347ns start 141243 end 17034976 duration 16.00ms start 17036947 end 33367923 duration 16.00ms start 33369099 end 33369568 duration 469ns start 33371106 end 33372127 duration 1.00us start 33373305 end 49673325 duration 16.00ms start 49675428 end 65978608 duration 16.00ms start 65979711 end 65979943 duration 232ns start 65980045 end 65980393 duration 348ns start 452588 end 461286 duration 8.00us start 462277 end 462524 duration 247ns start 462800 end 17200216 duration 16.00ms start 17201191 end 33798929 duration 16.00ms start 33799447 end 33799692 duration 245ns start 33800268 end 33800673 duration 405ns start 33801235 end 50242944 duration 16.00ms start 50243905 end 66632167 duration 16.00ms start 66632688 end 66632843 duration 155ns start 66632959 end 66633187 duration 228ns start 66687784 end 66691495 duration 3.00us

You can see all reduction tasks are instantaneous. The whole run takes 60ms (without logs) on my laptop with two threads while the sequential run takes 120ms. Additional threads don’t improve performances since we become memory bound.