**We are hiring a PhD student to do research on functional high performance computing, including work on Futhark itself. See here for more details.**

# Higher-order parallel programming

Whenever I explain parallel functional programming, whether to students or the barista at a coffee shop, one thing I must contend with is people’s prior experience with parallel programming. Quite often these experiences are with low-level multi-threaded programming, fraught with race conditions and other hazards. Their experience is that parallel programming is difficult and frustrating. And who can blame them? Shared-state multi-threaded programming is certainly one of the most difficult forms of programming I know of. However, this style of programming is neither necessary nor sufficient for parallelism. Concurrent programming can be useful with just a single processor, after all.

But still such preconceptions present a barrier when I have to
explain the kind of parallel programming we support in a language such
as Futhark. My strategy has become to point at NumPy, the Python array library, as an example of a
widely used parallel programming model; one that has shown that
high-level parallel programming can be just as accessible as
sequential programming. In this post I will elaborate on this theme
and show the limitations of NumPy’s *first-order* model, compared to
Futhark’s *higher-order* model.

## NumPy as a bulk-parallel programming model

NumPy is a Python library that makes available an array type,
along with various various functions and operators for manipulating
such arrays. The key feature is that most operations are implicitly
lifted to operate on *entire* arrays, rather than single elements at a
time. For example, if `x`

is a NumPy array, then `x+1`

produces
an array of the same size and type as `x`

, but with `1`

added to
each element. Similarly, if `x`

and `y`

are arrays, then `x+y`

produces their pairwise sum, corresponding to vector addition if `x`

and `y`

are one-dimensional, matrix addition if they are
two-dimensional, and so on. Good NumPy programming is based on these
*bulk operations* that operate on entire arrays, rather than on
individually manipulating elements.

The main advantage in NumPy is that these primitive operations are
implemented in efficient languages, such as C or Fortran, which will
run *much* faster than corresponding Python loops. But the advantage
goes deeper than that: an array addition `x+y`

is *potentially
parallel*. Without knowledge about anything else going on in the
program, we know that we can safely execute this addition in a
parallel manner, for example by using one thread for each element in
the result. Of course, literally launching one thread per element
would not be efficient, as a lone addition is far too little work to
amortise the cost of thread creation. But note how we are already
discussing how to make the parallelism *efficient*, not whether it is
safe or correct to parallelise in the first place!

Now, despite this potential parallelism, stock NumPy does not to my knowledge execute any of its operations in multiple threads. However, other implementations of the NumPy API, such as Numba or CuPy, do! Once we are working with a parallel programming model to start with, actually exploiting parallel execution becomes an engineering problem that is well within reach.

The thing that makes NumPy work as a parallel programming model is the
emphasis on bulk operations that operate on entire arrays. As long as
we have a sequentially consistent view of data *between* these
operations, the NumPy implementation can do whatever parallel tricks it
wishes *inside* them. We get the best of both worlds: sequential,
fully deterministic, line-at-a-time semantics, but (potentially)
efficient parallel execution. As human programmers we have to
express our code in terms of bulk operations, but we *never* have to
worry about race conditions or nondeterminism.

## Limitations of first-order parallel programming

For all its advantages, the NumPy model has weaknesses shared with all similar array programming models (such as the venerable APL). Ultimately, in these models, all you have available as a programmer is a large set of builtin array operations. If you want your own code to be efficient, in must be expressible in terms of these primitives. In some cases they are insufficiently flexible. In particular, it is tricky to define bulk operations that require per-element control flow.

As a somewhat contrived example, consider the problem of applying the following Python function to every element of a NumPy array:

```
def f(x):
if 0 < x:
return sqrt(x)
else:
return x
```

Now, while NumPy does provide a `map`

function, using it is rarely a
good idea. Since the function we apply to each element is an
arbitrary Python function, NumPy is no longer able to dispatch to some
highly-tuned native implementation. Worse, by providing an arbitrary
function, parallelisation is no longer guaranteed to be safe, as that
function may modify global data! To take advantage of the efficient
primitive operations, we end up having to encode the control flow as
data:

```
def sqrt_when_pos_1(x):
= x.copy()
x = x >= 0
x_nonneg = np.sqrt(x[x_nonneg])
x[x_nonneg] return x
```

This is not great. In particular the original function `f`

is
completely gone, so there is not any code re-use going on. Further,
those indexings are rather complex to parallelise, so we probably want
an even nastier formulation (which runs almost twice as fast, even
sequentially):

```
def sqrt_when_pos_2(x):
= x >= 0
x_nonneg = x < 0
x_neg = x * x_nonneg.astype(int)
x_zero_when_neg = x * x_neg.astype(int)
x_zero_when_nonneg = np.sqrt(x_zero_when_neg)
x_sqrt_or_zero return x_sqrt_or_zero + x_zero_when_nonneg
```

Nasty stuff. NumPy has a few ad-hoc mechanisms for “masked execution”, but it doesn’t change the fundamental fact that this is probably not how we like to think about our algorithms. Also, it gets worse. Let us consider the task of computing a Mandelbrot fractal, which essentially boils down to applying the following function to a bunch of independent complex numbers:

```
def divergence(c, d):
= 0
i = c
z while i < d and np.dot(z) < 4.0:
= c + z * z
z = i + 1
i return i
```

So how do we apply this function to every element of a NumPy array?
Handling `if`

in the previous example was bad enough. Handling a
`while`

loop is worse:

```
def mandelbrot(c, d):
= np.zeros(c.shape)
output = np.zeros(c.shape, np.complex32)
z for i in range(d):
= (z.real*z.real + z.imag*z.imag) < 4.0
notdone = i
output[notdone] = z[notdone]**2 + c[notdone]
z[notdone] return output
```

While this program is certainly expressed in terms of parallel bulk
operations, it does not spark joy. The control flow is obscured, it
always runs for `d`

iterations, and it causes a *lot* of memory
traffic, as the intermediate `output`

and `z`

arrays must be
manifested in memory. Compare this to the original `divergence`

function, which just involves a bunch of scalars that could in
principle be stored entirely in registers!

The problem is that NumPy is (practically) a *first-order* programming
model, in the sense that its operations are parameterised by values
(arrays and scalars), not functions. Put simply, NumPy lacks an
efficient `map`

.

## Futhark as a higher-order programming model

I am now going to show how Futhark allows us to expose parallelism
with nested control flow in a natural way. This is not intended as a
criticism of NumPy - higher-order parallel programming is a *very*
tricky thing to implement efficiently, and to a large extent it is
still an active research area, with implementations that are not as
robust as NumPy. In a metaphorical sense, Futhark is balancing on a
knife’s edge on promising more than the compiler can deliver.

But it does deliver here. For the square root problem, we just define our arbitrary scalar function, which looks like this in Futhark:

```
let f x = if 0 < x then f32.sqrt x
else x
```

In Futhark we can `map`

almost any function:

`let sqrt_when_pos xs = map f xs`

It just works, and will run quite fast too. What about Mandelbrot? Just as simple:

```
let divergence (c: complex) (d: i32): i32 =
let (_, i') =
loop (z, i) = (c, 0)
while i < d && dot(z) < 4.0 do
(add_complex c (mult_complex z z),1)
i + in i'
let mandelbrot [n][m] (c: [n][m]complex) (d: i32) : [n][m]i32 =
map (map (\x -> divergence x d)) c
```

For simplicity, I’m not using a complex number library, so things look a bit more awkward than they have to. The full code is available here.

What about performance? I mentioned that the NumPy-style Mandelbrot is inefficient because of excessive memory traffic, but how bad is it really? Comparing GPU-accelerated Futhark with sequential NumPy isn’t fair, but I can implement the NumPy approach in Futhark:

```
let numpy_mandelbrot [n][m] (c: [n][m]complex) (d: i32) : [n][m]i32 =
let nm = n*m
let c' = flatten_to nm c
let output = replicate nm 0
let z = replicate nm (0,0)
let (output, _) =
loop (output, z) for i < d do
let notdone = map (\(a,b) -> (a*a + b*b) < 4) z
let is = map2 (\b i -> if b then i else -1) notdone (iota nm)
let inc = map2 add_complex (map (\x -> mult_complex x x) z) c'
in (scatter output is (replicate nm i),
scatter z is inc)in unflatten n m output
```

This is actually a bit more efficient than the original NumPy
formulation, as I’m avoiding some expensive filters. It sure looks
nasty, but how fast is it? On my AMD Vega 64 GPU and for a 300x300
array, `numpy_mandelbrot`

runs in **7846 microseconds**, while
`mandelbrot`

runs in **110 microseconds**. That’s approaching two
orders of magnitude faster! This is entirely down to `mandelbrot`

being able to keep all its intermediate results in registers, and GPUs
are *ludicrously* fast when they never have to touch memory. In
contrast, `numpy_mandelbrot`

constantly has to shuffle data across
the relatively slow memory bus (350GiB/s), not to mention a lot of
extra synchronisation because many more discrete GPU kernels are
involved.

In conclusion, higher-order parallel programming is just as easy as first-order parallel programming, because it is still race-free and fully deterministic. But it allows us not just more powerful methods of abstraction, but also potentially much better performance.