**We are hiring a PhD student to do research on functional high performance computing, including work on Futhark itself. See here for more details.**

# Cost models are contracts

The most precise way of determining the performance of a program is running it and measuring how long it takes. Unfortunately, this tells us nothing about how the program will behave on different workloads or on another machine. Also, profiling your code all the time is very tedious.

Instead, programmers use a *cost model* to reason about the
performance of their code in a more general manner. We usually (but
not always) disregard small constants, and use big O
notation to describe
the *asymptotic* behaviour of a program’s runtime relative to the
workload. Most cost models tend to be informal and even
unstated. Every day we reason in terms of assumptions such as “all
primitive operations take constant time” and “loops execute their
iterations one after another”. These assumptions may seem obvious or
unavoidable, and most programmers are probably not even aware that
they are employing a cost model at all. Formalising precisely what it
means for performance that `a; b`

runs first `a`

and then `b`

may seem
like pointless academic puffery. However, in more complicated
operational settings, when we add concepts such as parallelism,
concurrency, or nondeterminism, it stops being intuitive just how fast
some program will be, and writing down a precise cost model can be
quite useful academic puffery. This also helps programmers and
algorithm designers reason about performance without focusing on
specific machines or implementations. The cost model becomes a
*contract*: by applying its rules, we get a (usually asymptotic)
*bound* on the “cost” (usually runtime) of a program. A language
implementation promises that when the program is run in practice, it
must be behave at least as well as promised by the cost model. Now,
I’m an idealist hacker and the word “contract” does sound unpalatably
business-like to me. But ultimately a contract is really a *promise*,
and I don’t mind making promises to my friends.

We intend to specify Futhark in terms of a fairly simple
*language-based cost model*, although we’re not all the way there yet.
In this post I will show some examples to illustrate the idea,
demonstrate how the rules can be used to communicate crucial
operational properties and restrictions, and finally show how simple
cost models (as the one we’ll see here) are not able to express things
that are *almost* always true, but cannot be guaranteed. Cost models
are contracts, which means that we should not break them, but we
should make sure to read the fine print. We’ll also look at some
cases where I think a full formalisation of the cost model actually
ends up inhibiting readability of the details that matter to a
programmer. I’m still not quite sure myself how to use the cost model
in documentation.

## A language-based cost model for Futhark

First, let us discuss the idea of a *compositional language-based cost
model* (I promise this is the longest italicised term in this post).
The idea is to define a cost function *W(e)* that for some expression
*e* tells us the “cost” of evaluating *e*. For now this “cost” will
be the total amount of work needed, but it could in principle be any
quantity (e.g. peak memory usage), and we’ll see a more exotic cost
function later.

As a starting point, we define that evaluating a constant requires constant work:

`W(k) = 1`

Evaluating an addition *x+y* requires us to evaluate the
left-hand-side, the right-hand side, and finally perform the addition
itself:

`W(x+y) = W(x) + W(y) + 1`

Defining the cost of an expression in terms of the cost of its
subexpressions is what makes our model *compositional*: the cost of an
expression does not depend on its context. This allows us to analyse
the cost of a program in small pieces whose results are then combined,
which is much simpler than having to consider the program as a whole.
The cost model is *language-based* because the cost function is
defined directly on expressions of the user-visible language, rather
than by first transforming the language to some kind of machine
representation. Cost models that are neither compositional nor
language-based exist, and can be useful, but they are more complicated
to work with. For example, a cost model for a language such as
Haskell, with lazy evaluation, probably will not be compositional and
language-oriented, since the cost of a term is constant until its
value is needed, which depends on how it is used. You’d probably need
some kind of machine model to track that.

Even though we have defined just two trivial rules so far, we have already introduced an interesting detail: the cost of an addition does not depend on the value being added. This implies that numbers must be bounded in size - an implementation that transparently used bignums would not be faithful to the contract. And indeed, Futhark’s built-in number types are all of fixed size.

Now let us consider an expression where the cost model *does* depend
on a value at runtime. The expression `iota e`

creates an array of
`e`

elements. If we denote the value resulting from evaluating an
expression `e`

as `eval(e)`

, we can describe the cost of `iota e`

as:

`W(iota e) = W(e) + eval(e)`

This means that creating the array of `n`

elements takes *O(n)* time.
This allows an obvious implementation strategy: allocate the array and
write each of the values. But what if we instead defined the cost
like this?

`W(iota e) = W(e) + 1`

Now no matter the size of the array we produce, the cost is the same.
This also suggests an implementation strategy: instead of actually
constructing the array in memory, represent it symbolically (or
lazily) as just its size. This works because the element at position
`i`

of an `iota`

array is exactly `i`

. Generally, when we define a
cost model for a language, we are free to promise whatever we want,
but each promise makes the language more difficult to implement.
Nothing prevents us from defining a language where everything is
specified to take constant time, but we might find it challenging to
write a compiler that upholds that contract. This is why Futhark’s
`iota`

specifies work linear in the size of the resulting array. In
*most cases*, the compiler will do better than this, and not actually
manifest the array in memory, but there are edge cases where it is not
practical to guarantee constant cost. I’ll return to this later, but
first I want to talk about parallelism.

## A parallel cost model

Work is not the only measure of cost we can define for a language.
Space usage is also an obvious candidate. But for parallel languages
such as Futhark, the most interesting measure is the
*span*. Intuitively, the
span *S(e)* of an expression *e* is the length of the longest path in
the data dependency graph that occurs during execution. Or put
another way, the length of the longest chain of sequential
dependencies. On an infinitely parallel computer, we can execute a
program with span *s* in *s* steps. While we don’t have infinitely
parallel computers available to us (and if we did, they’d all be stuck
in cryptomining farms anyway), Brent’s
Lemma tells
us that we can simulate an infinitely parallel computer on a finitely
parallel computer, with overhead proportional to the amount of
parallelism that we are “missing” relative to what the program could
potentially exploit. This means that we can use the span of a program
as a theoretical model for quantifying “how parallel” a program or
algorithm is in principle, and expect that this is directly connected
to how it will run on a real parallel computer.

Let us take a look at what a parallel cost model looks like for Futhark. First, the span of constants:

`S(k) = 1`

Unsurprisingly, constant. Addition is more interesting, as we have a choice to make. One choice is to define the span like this:

`S(x+y) = S(x) + S(y) + 1`

This suggests an implementation that first evaluates `x`

to
completion, then `y`

to completion, then performs the addition - just
as in a sequential language. Essentially we are told the very useful
piece of information that *addition operator application is not a
source of parallelism*. Now consider another equally valid way of
defining the span:

`S(x+y) = max(S(x), S(y)) + 1`

This tells us that the span is the maximum of the span of `x`

and `y`

.
This requires an implementation that evaluates `x`

and `y`

in
parallel, waits for both of them to finish, and then performs the
addition. Such a strategy is perfectly valid in a pure language such
as Futhark, where expressions cannot have side effects, and any
evaluation order is valid.

So which of these rules for `x+y`

do we actually define for the
Futhark cost model? Even though Futhark is a parallel language, and
so it seems tempting to *maximise parallelism* even in the cost model,
we actually pick the former rule, the one that does *not* promise
parallel evaluation of the operands. This is because *cost models are
contracts*. They are not for pointing out what *can* be done, they
are for pointing out what *must* be done: what the programmer can
absolutely rely on, and use to describe the asymptotic complexity of
their programs. In practice, ensuring *efficient* parallel execution
of everything that could *in principle* be executed in parallel is is
very difficult. Efficiently scheduling huge quantities of tiny
unstructured and heterogeneous nuggets of parallelism would be a major
research *and* engineering challenge. So despite parallelism
definitely being *the point* of Futhark, there are actually very few
language constructs that promise parallel execution, with `map`

being
the most important one. A cost model is a clear and precise way of
communicating such details to the programmer.

The cost functions above are quite simple. If we also want to handle
variable bindings, we need either symbol tables or substitution
semantics. For example, we might define the work of a `let`

-binding
like this:

```
W(let x = e1 in e2) =
W(e1) + W(e2[x↦eval(e1)])
```

This says that the cost is the cost of `e1`

, plus the cost of `e2`

after we first replace any instances of `x`

with the result of
evaluating `e1`

(in principle we’d also need to define the evaluation
semantics for `eval(e1)`

to be meaningful).

The rules get more complex, but not more interesting, as we introduce
other binding constructs such as pattern matching. If we want loops,
we need to incorporate fixed points of some kind into our cost
functions. This can get quite technically hairy, and while necessary
if we want to fully formalise or mechanise the cost model, it is not
really useful when using the cost model as part of language or library
documentation. You can have a perfectly accurate intuition for
something, only to have your confidence shaken by seeing someone
typing it up precisely with a bunch of Greek letters. I’m not quite
sure where to draw the line in documentation meant for humans. Maybe
documenting the cost model in terms of a simplified subset, and then
letting the cost of the full language be the “intuitive” extension?
This can perhaps work, but I’m not yet sure how to extend it to
higher-order functions such as `map`

.

### The cost of `map`

So speaking of `map`

, let us consider how to specify its cost,
starting with the work. Suppose the expression is `map f x`

.
Intuitively, we must evaluate `f`

(which can be an expression that
*produces* a function value) and `x`

, and then sum the cost of
applying the function to every element of `x`

. Then we define the
work of `map f x`

:

`W(map f x) = W(f) + W(x) + sum({W(f x[i]) for x[i] in x})`

This is not too bad, since we can precisely specify exactly which
arguments `f`

will be applied to. Note that we have not actually
defined a rule for the cost of the function application `f x[i]`

in
this post. It looks much like the one for `let`

-bindings, and is not
otherwise interesting.

Now let’s consider the span:

`S(map f x) = S(f) + S(x) + max({S(f x[i]) for x[i] in x})`

This suggests an implementation strategy: Launch a thread for every
element of `x`

and wait for each thread to finish, which means the
total time will be the time of the slowest application. This is
really useful information, since it also tells the programmer that
load-imbalanced `map`

s are a bad idea: they will run at the speed of
the slowest element. In practice, oversubscription and automatic load
balancing may reduce the impact, but this is not a *promise* made by the
cost model. This guides us in the way we write programs.

Interestingly (or embarrassingly), the GPU backends of the Futhark
compiler actually break this contract. Since our flattening
algorithm
currently only supports *regular* nested parallelism, any inner
instances of `map`

will be sequentialised if their size is variant to
any of the outer parallel levels. If we wanted to express this in the
cost model, it would no longer be compositional, since the span of an
expression would depend on where it appears. Improving the compiler
to support full flattening when needed is part of the
plan,
and the multicore backends faithfully implement the cost model right
now.

Another interesting case is `reduce op ne x`

, which reduces an array
`x`

with the binary operator `op`

that has neutral element `ne`

. What
is the span of a reduction? If you took our
course you will know that
for an array of length *n*, the span ought to be *log(n)*, but what if
the reduction operator itself does not have constant cost? For `map`

we could easily enumerate all the arguments that we passed to the
function `f`

, but the reduction operator `op`

will be applied not just
to elements of the array `x`

, but also to results of previous
applications of `op`

. And the *precise* values depend on the specific
order of evaluation used by whatever reduction algorithm the language
implementation happens to use, which is something we *definitely*
don’t want to specify in the cost model (see here for some different
ways of implementing reduction in
Futhark).
Almost no reduction operator will actually be ill behaved (most of
them do constant work on scalars), but *cost models are contracts*, so
we should be sure we do not promise too much. Currently, this is the
best I can come up with:

`S(reduce op ne x) = S(op) + S(ne) + S(x) + log(length(eval(x))) * W(op a b)`

where `a`

and `b`

are those hypotehtical values that would lead to the
*slowest possible* execution of `op a b`

. In parts of the
documentation you might also see that we simply say *W(op)* and
understand this to mean the worst-case cost of applying *op* to
*whatever it happens to encounter during execution*. We also tend to
not mention the cost of evaluating the `reduce`

arguments `ne`

and
`x`

, as this is *obviously* needed and always the same, and so not
particularly interesting.

The reason I am so concerned with `reduce`

(and don’t want to make it
too complicated) is that there is actually something very interesting
in the above rule that risks disappearing among the bookkeeping: the
span of `reduce op ne x`

depends on the *work* of `op`

! This tells us
precisely that Futhark’s `reduce`

*does not exploit any parallelism in
the reduction operator!* That can be quite important to a programmer.

This decision could have been made differently, but was picked to
nudge the programmer in the direction of writing code that is easy to
execute efficiently on a parallel machine. While is is possible to
implement `reduce`

such that it can exploit inner parallelism in the
reduction operator (and you can do so yourself), it is tricky to do so
efficiently, and in most cases not worth it.

## The even finer print

A cost model is a contract that makes promises about the worst case.
Unfortunately, there are some language constructs that are *almost
always* asymptotically faster than in the worst case. For example,
the `reverse`

function is actually implemented by returning a *view*
of the array, not by allocating space for a new array and copying the
array elements in reverse order. This means `reverse`

runs in
constant time no matter the size of the array.

In Futhark, views are not part of the type system. To the user,
`reverse`

returns an array. Inside the compiler, such views are
implemented by tweaking how arrays are mapped to memory blocks. If
`ys=reverse xs`

, then both `xs`

and `ys`

are associated with the same
piece of memory, but differ in the *index function* that maps array
indexes to memory offsets. Such index functions are not stored
dynamically as part of the array, but are a piece of compile-time
information that is used to generate the actual index calculation code
during code generation. (This reminds me that we should really write
a post on Futhark’s internal array/memory representation some day -
it’s quite nifty, especially after Philip
Munksgaards improvements).

Unfortunately, there are some cases where those array views cannot be
maintained and we are forced to manifest the array in memory. One
case is when the array is returned from an entry point. Another is
when it is returned from an `if`

branch where the array returned by
the other branch uses a substantially different index function
(typically this requires heavy abuse of all of
`flatten`

/`unflatten`

/`transpose`

/`rotate`

). Although these cases may
be uncommon, we should still be careful to make only promises that we
keep. I see two options:

Specify the cost of

`reverse`

,`transpose`

and similar index transformations as having cost proportional to the size of the array, suggesting that they actually perform copies.Specify that the cost of

`if`

and every similar construct has cost proportional to the size of the arrays they return, as this might force manifestation of views.

Option *2* is attractive because it places the cost where it
operationally actually happens. However, I think option *1* is
actually more elegant, because the cost is more directly associated
with the construct that produces a value. Further, in almost all
cases where we use an array view, we soon after perform some operation
(such as `map`

ing the array) that has at least the same cost as a
copy, so at a program level, the total asymptotic cost will be the
same. Also, option *1* allows for naive implementations (such as our
very own interpreter) that don’t bother with any of this “view”
business and just put together the new array as instructed.

This is a distinction that it is unfortunately difficult to be precise
about: when making promises we must talk about the worst case, and I
don’t know of a simple way to precisely quantify that the common case
is *much* faster. We could certainly formalise it fully, but I don’t
think the resulting model would look very simple. A promise is no
good if the recipient does not understand what is being promised.