Whenever I explain parallel functional programming, whether to students or the barista at a coffee shop, one thing I must contend with is people’s prior experience with parallel programming. Quite often these experiences are with low-level multi-threaded programming, fraught with race conditions and other hazards. Their experience is that parallel programming is difficult and frustrating. And who can blame them? Shared-state multi-threaded programming is certainly one of the most difficult forms of programming I know of. However, this style of programming is neither necessary nor sufficient for parallelism. Concurrent programming can be useful with just a single processor, after all.
But still such preconceptions present a barrier when I have to explain the kind of parallel programming we support in a language such as Futhark. My strategy has become to point at NumPy, the Python array library, as an example of a widely used parallel programming model; one that has shown that high-level parallel programming can be just as accessible as sequential programming. In this post I will elaborate on this theme and show the limitations of NumPy’s first-order model, compared to Futhark’s higher-order model.
NumPy is a Python library that makes available an array type,
along with various various functions and operators for manipulating
such arrays. The key feature is that most operations are implicitly
lifted to operate on entire arrays, rather than single elements at a
time. For example, if
x is a NumPy array, then
an array of the same size and type as
x, but with
1 added to
each element. Similarly, if
y are arrays, then
produces their pairwise sum, corresponding to vector addition if
y are one-dimensional, matrix addition if they are
two-dimensional, and so on. Good NumPy programming is based on these
bulk operations that operate on entire arrays, rather than on
individually manipulating elements.
The main advantage in NumPy is that these primitive operations are
implemented in efficient languages, such as C or Fortran, which will
run much faster than corresponding Python loops. But the advantage
goes deeper than that: an array addition
x+y is potentially
parallel. Without knowledge about anything else going on in the
program, we know that we can safely execute this addition in a
parallel manner, for example by using one thread for each element in
the result. Of course, literally launching one thread per element
would not be efficient, as a lone addition is far too little work to
amortise the cost of thread creation. But note how we are already
discussing how to make the parallelism efficient, not whether it is
safe or correct to parallelise in the first place!
Now, despite this potential parallelism, stock NumPy does not to my knowledge execute any of its operations in multiple threads. However, other implementations of the NumPy API, such as Numba or CuPy, do! Once we are working with a parallel programming model to start with, actually exploiting parallel execution becomes an engineering problem that is well within reach.
The thing that makes NumPy work as a parallel programming model is the emphasis on bulk operations that operate on entire arrays. As long as we have a sequentially consistent view of data between these operations, the NumPy implementation can do whatever parallel tricks it wishes inside them. We get the best of both worlds: sequential, fully deterministic, line-at-a-time semantics, but (potentially) efficient parallel execution. As human programmers we have to express our code in terms of bulk operations, but we never have to worry about race conditions or nondeterminism.
For all its advantages, the NumPy model has weaknesses shared with all similar array programming models (such as the venerable APL). Ultimately, in these models, all you have available as a programmer is a large set of builtin array operations. If you want your own code to be efficient, in must be expressible in terms of these primitives. In some cases they are insufficiently flexible. In particular, it is tricky to define bulk operations that require per-element control flow.
As a somewhat contrived example, consider the problem of applying the following Python function to every element of a NumPy array:
def f(x): if 0 < x: return sqrt(x) else: return x
Now, while NumPy does provide a
map function, using it is rarely a
good idea. Since the function we apply to each element is an
arbitrary Python function, NumPy is no longer able to dispatch to some
highly-tuned native implementation. Worse, by providing an arbitrary
function, parallelisation is no longer guaranteed to be safe, as that
function may modify global data! To take advantage of the efficient
primitive operations, we end up having to encode the control flow as
def sqrt_when_pos_1(x): = x.copy() x = x >= 0 x_nonneg = np.sqrt(x[x_nonneg]) x[x_nonneg] return x
This is not great. In particular the original function
completely gone, so there is not any code re-use going on. Further,
those indexings are rather complex to parallelise, so we probably want
an even nastier formulation (which runs almost twice as fast, even
def sqrt_when_pos_2(x): = x >= 0 x_nonneg = x < 0 x_neg = x * x_nonneg.astype(int) x_zero_when_neg = x * x_neg.astype(int) x_zero_when_nonneg = np.sqrt(x_zero_when_neg) x_sqrt_or_zero return x_sqrt_or_zero + x_zero_when_nonneg
Nasty stuff. NumPy has a few ad-hoc mechanisms for “masked execution”, but it doesn’t change the fundamental fact that this is probably not how we like to think about our algorithms. Also, it gets worse. Let us consider the task of computing a Mandelbrot fractal, which essentially boils down to applying the following function to a bunch of independent complex numbers:
def divergence(c, d): = 0 i = c z while i < d and np.dot(z) < 4.0: = c + z * z z = i + 1 i return i
So how do we apply this function to every element of a NumPy array?
if in the previous example was bad enough. Handling a
while loop is worse:
def mandelbrot(c, d): = np.zeros(c.shape) output = np.zeros(c.shape, np.complex32) z for i in range(d): = (z.real*z.real + z.imag*z.imag) < 4.0 notdone = i output[notdone] = z[notdone]**2 + c[notdone] z[notdone] return output
While this program is certainly expressed in terms of parallel bulk
operations, it does not spark joy. The control flow is obscured, it
always runs for
d iterations, and it causes a lot of memory
traffic, as the intermediate
z arrays must be
manifested in memory. Compare this to the original
function, which just involves a bunch of scalars that could in
principle be stored entirely in registers!
The problem is that NumPy is (practically) a first-order programming
model, in the sense that its operations are parameterised by values
(arrays and scalars), not functions. Put simply, NumPy lacks an
I am now going to show how Futhark allows us to expose parallelism with nested control flow in a natural way. This is not intended as a criticism of NumPy - higher-order parallel programming is a very tricky thing to implement efficiently, and to a large extent it is still an active research area, with implementations that are not as robust as NumPy. In a metaphorical sense, Futhark is balancing on a knife’s edge on promising more than the compiler can deliver.
But it does deliver here. For the square root problem, we just define our arbitrary scalar function, which looks like this in Futhark:
let f x = if 0 < x then f32.sqrt x else x
In Futhark we can
map almost any function:
let sqrt_when_pos xs = map f xs
It just works, and will run quite fast too. What about Mandelbrot? Just as simple:
let divergence (c: complex) (d: i32): i32 = let (_, i') = loop (z, i) = (c, 0) while i < d && dot(z) < 4.0 do (add_complex c (mult_complex z z),1) i + in i' let mandelbrot [n][m] (c: [n][m]complex) (d: i32) : [n][m]i32 = map (map (\x -> divergence x d)) c
For simplicity, I’m not using a complex number library, so things look a bit more awkward than they have to. The full code is available here.
What about performance? I mentioned that the NumPy-style Mandelbrot is inefficient because of excessive memory traffic, but how bad is it really? Comparing GPU-accelerated Futhark with sequential NumPy isn’t fair, but I can implement the NumPy approach in Futhark:
let numpy_mandelbrot [n][m] (c: [n][m]complex) (d: i32) : [n][m]i32 = let nm = n*m let c' = flatten_to nm c let output = replicate nm 0 let z = replicate nm (0,0) let (output, _) = loop (output, z) for i < d do let notdone = map (\(a,b) -> (a*a + b*b) < 4) z let is = map2 (\b i -> if b then i else -1) notdone (iota nm) let inc = map2 add_complex (map (\x -> mult_complex x x) z) c' in (scatter output is (replicate nm i), scatter z is inc)in unflatten n m output
This is actually a bit more efficient than the original NumPy
formulation, as I’m avoiding some expensive filters. It sure looks
nasty, but how fast is it? On my AMD Vega 64 GPU and for a 300x300
numpy_mandelbrot runs in 7846 microseconds, while
mandelbrot runs in 110 microseconds. That’s approaching two
orders of magnitude faster! This is entirely down to
being able to keep all its intermediate results in registers, and GPUs
are ludicrously fast when they never have to touch memory. In
numpy_mandelbrot constantly has to shuffle data across
the relatively slow memory bus (350GiB/s), not to mention a lot of
extra synchronisation because many more discrete GPU kernels are
In conclusion, higher-order parallel programming is just as easy as first-order parallel programming, because it is still race-free and fully deterministic. But it allows us not just more powerful methods of abstraction, but also potentially much better performance.