Search icon
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Learning Hub
Free Learning
Arrow right icon
Over 7,000 tech titles at $9.99 each with AI-powered learning assistants on new releases
Haskell High Performance Programming
Haskell High Performance Programming

Haskell High Performance Programming: Write Haskell programs that are robust and fast enough to stand up to the needs of today

By Samuli Thomasson
$43.99 $9.99
Book Sep 2016 408 pages 1st Edition
$43.99 $9.99
$15.99 Monthly
$43.99 $9.99
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Haskell High Performance Programming

Chapter 1. Identifying Bottlenecks

You have probably at least once written some very neat Haskell you were very proud of, until you test the code and it took ages to give an answer or even ran out of memory. This is very normal, especially if you are used to performance semantics in which performance can be analyzed on a step-by-step basis. Analyzing Haskell code requires a different mental model that is more akin to graph traversal.

Luckily, there is no reason to think that writing efficient Haskell is sorcery known only by math wizards or academics. Most bottlenecks are straightforward to identify with some understanding of Haskell's evaluation schema. This chapter will help you to reason about the performance of Haskell programs and to avoid some easily recognizable patterns of bad performance:

  • Understanding lazy evaluation schemas and their implications

  • Handling intended and unintended value memoization (CAFs)

  • Utilizing (guarded) recursion and the worker/wrapper pattern efficiently

  • Using accumulators correctly to avoid space leaks

  • Analyzing strictness and space usage of Haskell programs

  • Important compiler code optimizations, inlining and fusion

Meeting lazy evaluation

The default evaluation strategy in Haskell is lazy, which intuitively means that evaluation of values is deferred until the value is needed. To see lazy evaluation in action, we can fire up GHCi and use the :sprint command to inspect only the evaluated parts of a value. Consider the following GHCi session:

> let xs = enumFromTo 1 5 :: [Int]
> :sprint xs
xs = _
> xs !! 2
> :sprint xs
xs = 1 : 2 : 3 : _


The code bundle for the book is also hosted on GitHub at We also have other code bundles from our rich catalog of books and videos available at Check them out!

Underscores in the output of :sprint stand for unevaluated values. The enumFromTo function builds a linked list lazily. At first, xs is only an unevaluated thunk. Thunks are in a way pointers to some calculation that is performed when the value is needed. The preceding example illustrates this: after we have asked for the third element of the list, the list has been evaluated up to the third element. Note also how pure values are implicitly shared; by evaluating the third element after binding the list to a variable, the original list was evaluated up to the third element. It will remain evaluated as such in memory until we destroy the last reference to the list head.

The preceding figure is a graphical representation of how a list is stored in memory. A T stands for a thunk; simple arrows represent pointers.

The preceding scenario is otherwise identical to the previous one, but now the list is polymorphic. Polymorphic values are simply functions with implicit parameters that provide the required operations when the type is specialized.


Be careful with :sprint and ad hoc polymorphic values! For example, xs' = enumFromTo 1 5 is by default given the type Num a => [a]. To evaluate such an expression, the type for a must be filled in, meaning that in :sprint xs', the value xs' is different from its first definition. Fully polymorphic values such as xs'' = [undefined, undefined] are okay.

A shared value can be either a performance essential or an ugly space leak. Consider the following seemingly similar scenarios (run with ghci +RTS -M20m to not throttle your computer):

> Data.List.foldl' (+) 0 [1..10^6]

> let xs = [1..10^6] :: [Int]
> Data.List.foldl' (+) 0 xs
<interactive>: Heap exhausted;

So what happened? By just assigning the list to a variable, we exhausted the heap of a calculation that worked just fine previously. In the first calculation, the list could be garbage-collected as we folded over it. But in the second scenario, we kept a reference to the head of the linked list. Then the calculation blows up, because the elements cannot be garbage-collected due to the reference to xs.

Writing sum correctly

Notice that in the previous example we used a strict variant of left-fold called foldl' from Data.List and not the foldl exported from Prelude. Why couldn't we have just as well used the latter? After all, we are only asking for a single numerical value and, given lazy evaluation, we shouldn't be doing anything unnecessary. But we can see that this is not the case (again with ghci +RTS -M20m):

> Prelude.foldl (+) 0 [1..10^6]
<interactive>: Heap exhausted;

To understand the underlying issue here, let's step away from the fold abstraction for a moment and instead write our own sum function:

mySum :: [Int] -> Int
mySum     [] = 0
mySum (x:xs) = x + mySum xs

By testing it, we can confirm that mySum too exhausts the heap:

> :load sums.hs
> mySum [1..10^6]
<interactive>: Heap exhausted;

Because mySum is a pure function, we can expand its evaluation by hand as follows:

mySum [1..100]
    = 1 + mySum [2..100]
    = 1 + (2 + mySum [2..100])
    = 1 + (2 + (3 + mySum [2..100]))
    = ...
    = 1 + (2 + (... + mySum [100]))
 = 1 + (2 + (... + (100 + 0)))

From the expanded form we can see that we build up a huge sum chain and then start reducing it, starting from the very last element on the list. This means that we have all the elements of the list simultaneously in memory. From this observation, the obvious thing we could try is to evaluate the intermediate sum at each step. We could define a mySum2 which does just this:

mySum2 :: Int -> [Int] -> Int
mySum2 s []     = s
mySum2 s (x:xs) = let s' = s + x in mySum2 s' xs

But to our disappointment mySum2 blows up too! Let's see how it expands:

mySum2 0 [1..100]
    = let s1 = 0 + 1 in mySum2 s1 [2..100]
    = let s1 = 0 + 1
          s2 = s1 + 2
          in mySum2 s2 [2..100]
    = let s1 = 0 + 1
          s100 = s99 + 100
          in mySum2 s100 []
    = s100
    = s99 + 100
    = (s89 + 99) + 100
    = ((1 + 2) + ... ) + 100

Oops! We still build up a huge sum chain. It may not be immediately clear that this is happening. One could perhaps expect that 1 + 2 would be evaluated immediately as 3 as it is in strict languages. But in Haskell, this evaluation does not take place, as we can confirm with :sprint:

> let x = 1 + 2 :: Int
> :sprint x
x = _

Note that our mySum is a special case of foldr and mySum2 corresponds to foldl.

Weak head normal form

The problem in our mySum2 function was too much laziness. We built up a huge chain of numbers in memory only to immediately consume them in their original order. The straightforward solution then is to decrease laziness: if we could force the evaluation of the intermediate sums before moving on to the next element, our function would consume the list immediately. We can do this with a system function, seq:

mySum2' :: [Int] -> Int -> Int
mySum2' []     s = s
mySum2' (x:xs) s = let s' = s + x
                       in seq s' (mySum2' xs s')

This version won't blow up no matter how big a list you give it. Speaking very roughly, the seq function returns its second argument and ties the evaluation of its first argument to the evaluation of the second. In seq a b, the value of a is always evaluated before b. However, the notion of evaluation here is a bit ambiguous, as we will see soon.

When we evaluate a Haskell value, we mean one of two things:

  • Normal Form (NF): A fully evaluated value; for example when we show a value it will eventually be fully evaluated

  • Weak Head Normal Form (WHNF): Evaluated up to the first data constructor. seq evaluates its argument to WHNF only

Consider the following GHCi session:

> let t = const (Just "a") () :: Maybe String
> :sprint t
t = _
> t  `seq` ()
> :sprint t
t = Just _

Even though we seq the value of t, it was only evaluated up to the Just constructor. The list of characters inside was not touched at all. When deciding whether or not a seq is necessary, it is crucial to understand WHNF and your data constructors. You should take special care of accumulators, like those in the intermediate sums in the mySum* functions. Because the actual value of the accumulator is often used only after the iterator has finished, it is very easy to accidentally build long chains of unevaluated thunks.


We could have annotated strictness more cleanly with the strict cousin of ($), the ($!) operator: mySum2' s (x:xs) = mySum2' xs $! s + x.

($!) is defined as f $! x = x `seq` f x.

Folding correctly

Now going back to folds, we recall that both foldl and foldr failed miserably, while (+) . foldl' instead did the right thing. In fact, if you just turn on optimizations in GHC then foldl (+) 0 will be optimized into a tight constant-space loop. This is the mechanism behind why we can get away with Prelude.sum, which is defined as foldl (+) 0.

What do we then have the foldr for? Let's look at the evaluation of foldr f a xs:

foldr f a [x1,x2,x3,x4,...]
    = f x1 (foldr f a [x2,x3,x4,...])
    = f x1 (f x2 (foldr f a [x3,x4,...]))
    = f x1 (f x2 (f x3 (foldr f a [x4,...])))

Note that, if the operator f isn't strict in its second argument, then foldr does not build up chains of unevaluated thunks. For example, let's consider foldr (:) [] [1..5]. Because (:) is just a data constructor, it is for sure lazy in its second (and first) argument. That fold then simply expands to 1 : (2 : (3 : ...)), so it is the identity function for lists.

Monadic bind (>>) for the IO monad is another example of a function that is lazy in its second argument:

foldr (>>) (return ()) [putStrLn "Hello", putStrLn "World!"]

For those monads whose actions do not depend on later actions, (that is, printing "Hello" is independent from printing "World!" in the IO monad), bind is non-strict in its second argument. On the other hand, the list monad is an example where bind is generally non-strict in its second argument. (Bind for lists is strict unless the first argument is the empty list.)

To sum up, use a left-fold when you need an accumulator function that is strict in both its operands. Most of the time, though, a right-fold is the best option. And with infinite lists, right-fold is the only option.

Memoization and CAFs

Memoization is a dynamic programming technique where intermediate results are saved and later reused. Many string and graph algorithms make use of memoization. Calculating the Fibonacci sequence, instances of the knapsack problem, and many bioinformatics algorithms are almost inherently solvable only with dynamic programming. A classic example in Haskell is the algorithm for the nth Fibonacci number, of which one variant is the following:

-- file: fib.hs

fib_mem :: Int -> Integer
fib_mem = (map fib [0..] !!)
  where fib 0 = 1
        fib 1 = 1
        fib n = fib_mem (n-2) + fib_mem (n-1)

Try it with a reasonable input size (10000) to confirm it does memoize the intermediate numbers. The time for lookups grows in size with larger numbers though, so a linked list is not a very appropriate data structure here. But let's ignore that for the time being and focus on what actually enables the values of this function to be memoized.

Looking at the top level, fib_mem looks like a normal function that takes input, does a computation, returns a result, and forgets everything it did with regard to its internal state. But in reality, fib_mem will memoize the results of all inputs it will ever be called with during its lifetime. So if fib_mem is defined at the top level, the results will persist in memory over the lifetime of the program itself!

The short story of why memoization is taking place in fib_mem stems from the fact that in Haskell functions exist at the same level with normal values such as integers and characters; that is, they are all values. Because the parameter of fib_mem does not occur in the function body, the body can be reduced irrespective of the parameter value. Compare fib_mem to this fib_mem_arg:

fib_mem_arg :: Int -> Integer
fib_mem_arg x = map fib [0..] !! x
  where fib 0 = 1
        fib 1 = 1
        fib n = fib_mem_arg (n-2) + fib_mem_arg (n-1)

Running fib_mem_arg with anything but very small arguments, one can confirm it does no memoization. Even though we can see that map fib [0..] does not depend on the argument number and could be memorized, it will not be, because applying an argument to a function will create a new expression that cannot implicitly have pointers to expressions from previous function applications. This is equally true with lambda abstractions as well, so this fib_mem_lambda is similarly stateless:

fib_mem_lambda :: Int -> Integer
fib_mem_lambda = \x -> map fib [0..] !! x
  where fib 0 = 1
        fib 1 = 1
        fib n = fib_mem_lambda (n-2) + fib_mem_lambda (n-1)

With optimizations, both fib_mem_arg and fib_mem_lambda will get rewritten into a form similar to fib_mem. So in simple cases, the compiler will conveniently fix our mistakes, but sometimes it is necessary to reorder complex computations so that different parts are memoized correctly.


Be wary of memoization and compiler optimizations. GHC performs aggressive inlining (Explained in the section, Inlining and stream fusion) as a routine optimization, so it's very likely that values (and functions) get recalculated more often than was intended.

Constant applicative form

The formal difference between fib_mem and the others is that the fib_mem is something called a constant applicative form, or CAF for short. The compact definition of a CAF is as follows: a supercombinator that is not a lambda abstraction. We already covered the not-a-lambda abstraction, but what is a supercombinator?

A supercombinator is either a constant, say 1.5 or ['a'..'z'], or a combinator whose subexpressions are supercombinators. These are all supercombinators:

\n -> 1 + n
\f n -> f 1 n
\f -> f 1 . (\g n -> g 2 n)

But this one is not a supercombinator:

\f g -> f 1 . (\n -> g 2 n)

This is because g is not a free variable of the inner lambda abstraction.

CAFs are constant in the sense that they contain no free variables, which guarantees that all thunks a CAF references directly are also constants. Actually, the constant subvalues are a part of the value. Subvalues are automatically memoized within the value itself.

A top-level [Int], say, is just as valid a value as the fib_mem function for holding references to other values. You should pay attention to CAFs in your code because memoized values are space leaks when the memoization was unintended. All code that allocates lots of memory should be wrapped in functions that take one or more parameters.

Recursion and accumulators

Recursion is perhaps the most important pattern in functional programming. Recursive functions are more practical in Haskell than in imperative languages, due to referential transparency and laziness. Referential transparency allows the compiler to optimize the recursion away into a tight inner loop, and laziness means that we don't have to evaluate the whole recursive expression at once.

Next we will look at a few useful idioms related to recursive definitions: the worker/wrapper transformation, guarded recursion, and keeping accumulator parameters strict.

The worker/wrapper idiom

Worker/wrapper transformation is an optimization that GHC sometimes does, but worker/wrapper is also a useful coding idiom. The idiom consists of a (locally defined, tail-recursive) worker function and a (top-level) function that calls the worker. As an example, consider the following naive primality test implementation:

-- file: worker_wrapper.hs

isPrime :: Int -> Bool
isPrime n
    | n <= 1    = False
    | n <= 3    = True
    | otherwise = worker 2
         worker i | i >= n       = True
                  | mod n i == 0 = False
                  | otherwise    = worker (i+1)

Here, isPrime is the wrapper and worker is the worker function. This style has two benefits. First, you can rest assured it will compile into optimal code. Second, the worker/wrapper style is both concise and flexible; notice how we did preliminary checks in the wrapper code before invoking the worker, and how the argument n is also (conveniently) in the worker's scope too.

Guarded recursion

In strict languages, tail-call optimization is often a concern with recursive functions. A function f is tail-recursive if the result of a recursive call to f is the result. In a lazy language such as Haskell, tail-call "optimization" is guaranteed by the evaluation schema. Actually, because in Haskell evaluation is normally done only up to WHNF (outmost data constructor), we have something more general than just tail-calls, called guarded recursion. Consider this simple moving average implementation:

-- file: sma.hs
sma :: [Double] -> [Double]
sma (x0:x1:xs) = (x0 + x1) / 2 : sma (x1:xs)
sma         xs = xs

The sma function is not tail-recursive, but nonetheless it won't build up a huge stack like an equivalent in some other language might do. In sma, the recursive callis guarded by the (:) data constructor. Evaluating the first element of a call to sma does not yet make a single recursive call to sma. Asking for the second element initiates the first recursive call, the third the second, and so on.

As a more involved example, let's build a reverse polish notation (RPN) calculator. RPN is a notation where operands precede their operator, so that (3 1 2 + *) in RPN corresponds to ((3 + 1) * 2), for example. To make our program easier to understand, we wish to separate parsing the input from performing the calculation:

-- file: rpn.hs
data Lex = Number Double Lex
         | Plus Lex
         | Times Lex
         | End

lexRPN :: String -> Lex
lexRPN = go . words
  where go ("*":rest) = Times (go rest)
        go ("+":rest) = Plus (go rest)
        go (num:rest) = Number (read num) (go rest)
        go         [] = End

The Lex datatype represents a formula in RPN and is similar to the standard list type. The lexRPN function reads a formula from string format into our own datatype. Let's add an evalRPN function, which evaluates a parsed RPN formula:

evalRPN :: Lex -> Double
evalRPN = go []
    go stack (Number num rest)
       = go (num : stack) rest
    go (o1:o2:stack) (Plus rest)
       = let r = o1 + o2 in r `seq` go (r : stack) rest
    go (o1:o2:stack) (Times rest)
       = let r = o1 * o2 in r `seq` go (r : stack) rest
    go [res] End
       = res

We can test this implementation to confirm that it works:

> :load rpn.hs
> evalRPN $ lexRPN "5 1 2 + 4 * *"

The RPN expression (5 1 2 + 4 * *) is (5 * ((1 + 2) * 4)) in infix, which is indeed equal to 60.

Note how the lexRPN function makes use of guarded recursion when producing the intermediate structure. It reads the input string incrementally and yields the structure an element at a time. The evaluation function evalRPN consumes the intermediate structure from left to right and is tail-recursive, so we keep the minimum amount of things in memory at all times.


Linked lists equipped with guarded recursion (and lazy I/O) actually provide a lightweight streaming facility – for more on streaming see Chapter 6, I/O and Streaming.

Accumulator parameters

In our examples so far, we have encountered a few functions that used some kind of accumulator. mySum2 had an Int that increased on every step. The go worker function in evalRPN passed on a stack (a linked list). The former had a space leak, because we didn't require the accumulator's value until at the end, at which point it had grown into a huge chain of pointers. The latter case was okay because the stack didn't grow in size indefinitely and the parameter was sufficiently strict in the sense that we didn't unnecessarily defer its evaluation. The fix we applied in mySum2' was to force the accumulator to WHNF at every iteration, even though the result was not strictly speaking required in that iteration.

The final lesson is that you should apply special care to your accumulator's strictness properties. If the accumulator must always be fully evaluated in order to continue to the next step, then you're automatically safe. But if there is a danger of an unnecessary chain of thunks being constructed due to a lazy accumulator, then adding a seq (or a bang pattern, see Chapter 2, Choose the Correct Data Structures) is more than just a good idea.

Inspecting time and space usage

It is often necessary to have numbers about the time and space usage of Haskell programs, either to have an indicator of how well the program performs or to identify unnecessary allocations. The GHC Runtime System flag -s enables printing allocation and garbage-collection statistics when the program finishes.

Let's try this with an example program, which naively calculates the covariance of two lists:

-- file: time_and_space.hs
import Data.List (foldl')

sum' = foldl' (+) 0

mean :: [Double] -> Double
mean v = sum' v / fromIntegral (length v)

covariance :: [Double] -> [Double] -> Double
covariance xs ys =
    sum' (zipWith (\x y -> (x - mean xs) * (y - mean ys)) xs ys)
    / fromIntegral (length xs)

main = do
    let xs = [1, 1.1 .. 500]
        ys = [2, 2.1 .. 501]
    print $ covariance xs ys

To enable passing options for the Runtime System, we must compile with -rtsopts:

$ ghc -rtsopts time_and_space.hs

For the time being, we ignore optimizations GHC could do for us and compile the program without any:

$ ./time_and_space +RTS -s
     802,142,688 bytes allocated in the heap
       1,215,656 bytes copied during GC
         339,056 bytes maximum residency (2 sample(s))
          88,104 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1529 colls,     0 par    0.008s   0.007s     0.0000s    0.0004s
  Gen  1         2 colls,     0 par    0.001s   0.001s     0.0003s    0.0006s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.072s  (  1.073s elapsed)
  GC      time    0.008s  (  0.008s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.083s  (  1.082s elapsed)

  %GC     time       0.8%  (0.7% elapsed)

  Alloc rate    747,988,284 bytes per MUT second

  Productivity  99.2% of total user, 99.3% of total elapsed

On the first line of output from the Runtime System, we see that we allocated over 800 megabytes of memory. This is quite a lot for a program that only handles two lists of 5,000 double-precision values. There is definitely something in our code that could be made a lot more efficient. The output also contains other useful information, such as the total memory in use and, more importantly, some statistics on garbage collection. Our program spent only 0.8% of time in GC, meaning the program was doing productive things 99.2% of the time. So our performance problem lies in the calculations our program performs themselves.

If we look at the definition of covariance, we can spot the many invocations to mean in the argument lambda to zipWith: we actually calculate the means of both lists thousands of times over. So let's optimize that away:

covariance' :: [Double] -> [Double] -> Double
covariance' xs ys =
    let mean_xs = mean xs
        mean_ys = mean ys
    sum' (zipWith (\x y -> (x - mean_xs) * (y - mean_ys)) xs ys)
    / fromIntegral (length xs)

With covariance' we get down to three megabytes of allocation:

       3,263,680 bytes allocated in the heap
         915,024 bytes copied during GC
         339,032 bytes maximum residency (2 sample(s))
         112,936 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         5 colls,     0 par    0.002s   0.002s     0.0003s    0.0005s
  Gen  1         2 colls,     0 par    0.001s   0.001s     0.0005s    0.0010s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.003s  (  0.003s elapsed)
  GC      time    0.003s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.008s  (  0.006s elapsed)

  %GC     time      35.3%  (44.6% elapsed)

  Alloc rate    1,029,648,194 bytes per MUT second

  Productivity  63.1% of total user, 79.6% of total elapsed

That's over a 250-fold decrease in heap allocation! With the new version, we now have a considerable amount of time going to GC, about a third. This is about as good as we can get without enabling compiler optimizations; if we compile with -O, we would get to under two megabytes of heap allocation. And if you tried the original covariance performance with optimizations on, you should get exactly the same performance as with the newer hand-optimized variant. In fact, both versions compile to the same assembly code. This is a demonstration of the sophistication of GHC's optimizer, which we will take a deeper look at in a later chapter.


GHCi tip:

By setting +s in the interpreter, you can get time and space statistics of every evaluation, which can be handy for quick testing. Keep in mind though that no optimizations can be enabled for interpreted code, so compiled code can have very different performance characteristics. To test with optimizations, you should compile the module with optimizations and then import it into GHCi.

Increasing sharing and minimizing allocation

In the covariance example, we observed that we could improve code performance by explicitly sharing the result of an expensive calculation. Alternatively, enabling compiler optimizations would have had that same effect (with some extras). Most of the time, the optimizer does the right thing, but that is not always the case. Consider the following versions of rather a silly function:

-- file: time_and_space_2.hs

goGen        u = sum [1..u] + product [1..u]
goGenShared  u = let xs = [1..u] in sum xs + product xs

Try reasoning which of these functions executes faster. The first one builds two possibly very large lists and then immediately consumes them, independent of each other. The second one shares the list between sum and product.

The list-sharing function is about 25% slower than the list-rebuilding function. When we share the list, we need to keep the whole list in memory, whereas by just enumerating the elements we can discard the elements as we go. The following table confirms our reasoning. The list-sharing function has a larger maximum residency in system memory and does more GC:

U = 10000


Allocated heap

Copied during GC

Maximum residency

Total memory

Time in GC



87 MB

10 MB

0.7 MB

6 MB




88 MB

29 MB

0.9 MB

7 MB


Recall that, in the covariance example, the compiler automatically shared the values of sin x and cos x for us when we enabled optimizations. But in the previous example, we didn't get implicit sharing of the lists, even though they are thunks just like the results of sin x and cos x. So what magic enabled the GHC optimizer to choose the best sharing schema in both cases? The optimizer is non-trivial, and unfortunately, in practice it's not feasible to blindly rely on the optimizer always doing "the right thing." If you need to be sure that some sharing will take place, you should test it yourself.

Let's go back to our previous example of sum and product. Surely we could do better than spending 60% of the time in GC. The obvious improvement would be to make only one pass through one list and calculate both the sum and product of the elements simultaneously. The code is then a bit more involved:

goGenOnePass u = su + pu
    (su, pu) = foldl f (0,1) [1..u]
    f (s, p) i = let s' = s+i
                     p' = p*i
                     in s' `seq` p' `seq` (s', p')

Note the sequential use of seq in the definition of goGenOnePass. This version has a much better performance: only 10% in GC and about 50% faster than our first version:

U = 10000


Allocated heap

Copied during GC

Maximum residency

Total memory

Time in GC



86 MB

0.9 MB

0.05 MB

2 MB


The takeaway message is that once again algorithmic complexity matters more than technical optimizations. The one-pass version executed in half the time of the original two-pass version, as would be expected.


With the Bang Patterns (BangPatterns) language extension (available since GHC 6.6) the f binding could have been written more cleanly as f (!s, !p) i = (s + i, p * I) with very slightly degraded performance (0.7%). Annotating a binding with a bang means that evaluation of that binding will be bound to the evaluation of its surrounding tuple.

Compiler code optimizations

Haskell compilers perform aggressive optimization transformations on code. GHC optimization passes are highly sophisticated, so much that one rarely needs to worry about performance. We have seen some of the effects of ghc -O1 in our examples so far; in all cases,-O1increased performance relative to no optimizations, or -Onot, and in some optimizations passes were the difference between constant and exponential complexity.

Inlining and stream fusion

GHC performs aggressive inlining, which simply means rewriting a function call with the function's definition. Because all values in Haskell are referentially transparent, any function can be inlined within the scope of its definition. Especially in loops, inlining improves performance drastically. The GHC inliner does inlining within a module, but also to some extent cross-module and cross-package.

Some rules of thumb regarding inlining:

  • If a definition is only used once, and isn't exported, it will always be inlined.

  • When a function body is small, it will almost certainly be inlined no matter where or how often it is used.

  • Bigger functions may be inlined cross-module. To ensure that foo is always inlined, add a {-# INLINE foo #-} pragma near the definition of foo.

With these easy rules, you rarely need to worry about problems from bad inlining. For completeness's sake, there is also a NOINLINE pragma which ensures a definition is never inlined. NOINLINE is mostly used for hacks that would break referential transparency; see Chapter 4, The Devil's in the Detail.

Another powerful technique is stream fusion. Behind that fancy name is just a bunch of equations that are used to perform code rewriting (see Chapter 4, The Devil's in the Detail for the technicalities).

When working with lists, you may be tempted to rewrite code like this:

map f . map g . map h

Rather than to use intermediate lists:

map (f . g . h)

But there is no other reason than cosmetics to do this, because with optimizations GHC performs stream fusion, after which both expressions are time- and space-equivalent. Stream fusion is also performed for other structures than [], which we will take a look at in the next chapter.

Polymorphism performance

In principle, (ad hoc) polymorphic programs should carry a performance cost. To evaluate a polymorphic function, a dictionary must be passed in, which contains the specializations for the type specified on the caller side. However, almost always GHC can fill in the dictionary already at compile time, reducing the cost of polymorphism to zero. The big and obvious exception is code that uses reflection (Typeable). Also, some sufficiently complex polymorphic code might defer the dictionary passing to runtime, although, most of the time you can expect a zero cost.

Either way, it might ease your mind to have some notion of the cost of dictionary passing in runtime. Let's write a program with both general and specialized versions of the same function, compile it without optimizations, and compare the performance. Our program will just iterate a simple calculation with double-precision values:

-- file: class_performance.hs

class Some a where
    next :: a -> a -> a

instance Some Double where
    next a b = (a + b) / 2

goGeneral :: Some a => Int -> a -> a
goGeneral 0 x = x
goGeneral n x = goGeneral (n-1) (next x x)

goSpecialized :: Int -> Double -> Double
goSpecialized 0 x = x
goSpecialized n x = goSpecialized (n-1) (next' x x)

next' :: Double -> Double -> Double
next' a b = (a + b) / 2

I compiled and ran both versions separately with their own main entry points using the following command lines:

ghc class_performance.hs
time ./class_performance +RTS -s

On my machine, with 5,000,000 iterations, the general version does 1.09 GB of allocation and takes 3.4s. The specialized version does 1.01 GB of allocation and runs in about 3.2s. So the extra memory cost was about 8%, which is considerable. But by enabling optimizations, both versions will have exactly the same performance.

Partial functions

Here's a puzzle: given the following definition, which is faster, partial or total?

partialHead :: [a] -> a
partialHead (x:_) = x

totalHead :: [a] -> Maybe a
totalHead []    = Nothing
totalHead (x:_) = Just x

partial = print $ partialHead [1..]

total = print $ case totalHead [1..] of
                  Nothing -> 1
                    Just n -> n

The total variant uses a head that wraps its result inside a new data constructor, whereas the partial one results in a crash when a case is not matched, but in exchange doesn't perform any extra wrapping. Surely the partial variant must be faster, right? Well, almost always it is not. Both functions have exactly the same time and space requirements.

Partial functions are justified in some situations, but performance is rarely if ever one of them. In the example, the Maybe-wrapper of total will have a zero performance cost. The performance cost of the case analysis will be left, however, but a similar analysis is done in the partial variant too; the error case must be handled anyway, so that the program can exit gracefully. Of course, even GHC is not a silver bullet and you should always keep in mind that it might miss some optimizations. If you absolutely need to rely on certain optimizations to take place, you should test your program to confirm the correct results.


In this chapter, we learned how lazy evaluation works, what weak head normal form is, and how to control it by increasing strictness with different methods. We considered the peculiarities of right-fold, left-fold, and strict left-fold, and in which situations one fold strategy works better than another. We introduced the concept of CAF along with memoization techniques, utilized the worker/wrapper pattern, and used guarded recursion to write clean and efficient recursive programs.

We used the :sprint command in GHCi to inspect unevaluated thunks and the Runtime System option -s to inspect the heap usage and GC activity of compiled programs. We took a look at inlining, stream fusion, and the performance costs of partial functions and polymorphism.

In the next chapter, we will take a look at other basic data and control structures, such as different array structures and some monads. But first, we will learn about the performance semantics of Haskell data types and related common optimization techniques.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore the benefits of lazy evaluation, compiler features, and tools and libraries designed for high performance
  • Write fast programs at extremely high levels of abstraction
  • Work through practical examples that will help you address the challenges of writing efficient code


Haskell, with its power to optimize the code and its high performance, is a natural candidate for high performance programming. It is especially well suited to stacking abstractions high with a relatively low performance cost. This book addresses the challenges of writing efficient code with lazy evaluation and techniques often used to optimize the performance of Haskell programs. We open with an in-depth look at the evaluation of Haskell expressions and discuss optimization and benchmarking. You will learn to use parallelism and we'll explore the concept of streaming. We’ll demonstrate the benefits of running multithreaded and concurrent applications. Next we’ll guide you through various profiling tools that will help you identify performance issues in your program. We’ll end our journey by looking at GPGPU, Cloud and Functional Reactive Programming in Haskell. At the very end there is a catalogue of robust library recommendations with code samples. By the end of the book, you will be able to boost the performance of any app and prepare it to stand up to real-world punishment.

What you will learn

[*] Program idiomatic Haskell that s also surprisingly efficient [*] Improve performance of your code with data parallelism, inlining, and strictness annotations [*] Profile your programs to identify space leaks and missed opportunities for optimization [*] Find out how to choose the most efficient data and control structures [*] Optimize the Glasgow Haskell Compiler and runtime system for specific programs [*] See how to smoothly drop to lower abstractions wherever necessary [*] Execute programming for the GPU with Accelerate [*] Implement programming to easily scale to the cloud with Cloud Haskell

Product Details

Country selected

Publication date : Sep 26, 2016
Length 408 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781786464217
Vendor :
The Glasgow Haskell Team
Category :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details

Publication date : Sep 26, 2016
Length 408 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781786464217
Vendor :
The Glasgow Haskell Team
Category :

Table of Contents

21 Chapters
Haskell High Performance Programming Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewer Chevron down icon Chevron up icon Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. Identifying Bottlenecks Chevron down icon Chevron up icon
2. Choosing the Correct Data Structures Chevron down icon Chevron up icon
3. Profile and Benchmark to Your Heart's Content Chevron down icon Chevron up icon
4. The Devil's in the Detail Chevron down icon Chevron up icon
5. Parallelize for Performance Chevron down icon Chevron up icon
6. I/O and Streaming Chevron down icon Chevron up icon
7. Concurrency and Performance Chevron down icon Chevron up icon
8. Tweaking the Compiler and Runtime System (GHC) Chevron down icon Chevron up icon
9. GHC Internals and Code Generation Chevron down icon Chevron up icon
10. Foreign Function Interface Chevron down icon Chevron up icon
11. Programming for the GPU with Accelerate Chevron down icon Chevron up icon
12. Scaling to the Cloud with Cloud Haskell Chevron down icon Chevron up icon
13. Functional Reactive Programming Chevron down icon Chevron up icon
14. Library Recommendations Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by

No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial


How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to
  • To contact us directly if a problem is not resolved, use
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.