Moore's Law is over :(

Cray 1 - the first supercomputer to implement the vector processor design

- Intel
- SSE (Intel P6, 1999)
- SSE2 (Intel Pentium M, 2003)
- SSE3 (NetBurst - Pentium 4, 2004)
- SSE4.1 (Intel Core, 2006)
- SSE4.2 (Nehalem, 2008)
- AVX (SandyBridge, 2011)
- AVX2 (Haswell, 2013)
- AVX512 (Canon Lake, 2018)

- AMD

- GCC
- LLVM Toolchain
- Dyalog APL

Can a purely functional language support vectorisation?

The question is incomplete...

Can a purely functional language support vectorisation, *preserving its original syntax and semantics while being as efficient as low level systems programming languages*?

Parsing the previous statement :

- High Performance
- Declarative Syntax

We answer that in 2 steps.

- a
*low level*code generator - a
*high level*vector programming library

LLVM backend for NESL

NESL : An ML Dialect by Guy Blelloch (CMU) which supports Nested Data Parallelism

- Compiler written in Common Lisp
- Flattens almost every level of parallelism
- Small prototype language without a great I/O manager
- Hard to debug

State of the art optimising compiler

Where have we made code changes?

- Parser
- Add some new runtime representations
- STG -> Cmm bridge
- Cmm (multiple layers)
- Native code generator (x86 Assembly)

Fundamental Code changes:

compiler/nativeGen/Format.hs

```
data Format = ...
| VecFormat !Length !ScalarFormat !Width
type Length = Int
data Width = ... | W128 | W256 | W512
```

compiler/nativeGen/x86/Instr.hs

```
data Instr = VBROADCAST Format AddrMode Reg
| VEXTRACT Format Operand Reg Operand
...
-- number of other instructions like vector move,
-- shuffle, arithmetic, logical x86 instructions
```

compiler/cmm/CmmMachOp.hs

```
data MachOp = MO_VF_Broadcast Length Width
| MO_VF_Insert ....
```

70% code changes in compiler/nativeGen/X86/CodeGen.hs

Lots of seg-faults and code reviews later

*low level*code generator changes`Float`

and`Double`

SIMD support- redefine integers to support SIMD

The user-level API

```
-- constructors
packFloatX4# :: (# Float#, Float#, Float#, Float# #) -> FloatX4#
broadcastFloatX4# :: Float# -> FloatX4#
-- destructors
unpackFloatX4# :: FloatX4# (# Float#, Float#, Float#, Float# #)
-- operations
plusFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
minusFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
timesFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
divideFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
```

SSE as well as AVX support

Previously we had

```
data Int8 = I8# Int#
data Int16 = I16# Int#
data Int32 = I32# Int#
data Int64 = I64# Int#
```

Now GHC has :

`Int8#`

, `Word8#`

, `Int16#`

, `Word16#`

, `Int32#`

, `Word32#`

, `Int64#`

, `Word64#`

Example : Vectorizing dot product

This spoils the declarative nature of Haskell!

Imperative languages have powerful vector programming libraries

- Intel Math Kernel Library (MKL)
- Vc (https://github.com/VcDevel/Vc)
- AMD Libm

Polymorphic SIMD functions for vector programming

Lift-vector provides two interfaces

- Data.Primitive
- Data.Operations

Data.Primitive

```
class (Num v, Real (Elem v)) => SIMDVector v where
type Elem v
type ElemTuple v
nullVector :: v
vectorSize :: v -> Int
elementSize :: v -> Int
broadcastVector :: Elem v -> v
mapVector :: (Elem v -> Elem v) -> v -> v
zipVector :: (Elem v -> Elem v -> Elem v) -> v -> v -> v
foldVector :: (Elem v -> Elem v -> Elem v) -> v -> Elem v
sumVector :: v -> Elem v
packVector :: ElemTuple v -> v
unpackVector :: v -> ElemTuple v
```

From github/ajscholl/primitive

What does dot product look like with this?

Data.Operations

```
class (Num a, Num b) =>
ArithVector t a b
where
-- | The folding function should be commutative.
fold :: (a -> a -> a) -> (b -> b -> b) -> b -> t b -> b
zipVec ::
(a -> a -> a) -> (b -> b -> b) -> t b -> t b -> t b
fmap :: (a -> a) -> (b -> b) -> t b -> t b
-- | Work efficient Blelloch Scan
scanb :: (a -> a -> a) -> (b -> b -> b) -> b -> t b -> t b
```

- Vector Lists
- Vector Arrays (Shape polymorphic)

`newtype VecList a = VecList [a]`

A wrapper over plain lists

Wrappers around unboxed vectors.

```
import qualified Data.Vector.Unboxed as U
data VecArray sh a = VecArray !sh (U.Vector a)
```

```
data Z = Z
data tail :. head = !tail :. !head
type DIM0 = Z
type DIM1 = DIM0 :. Int
type DIM2 = DIM1 :. Int
type DIM3 = DIM2 :. Int
class Eq sh => Shape sh where
toIndex :: sh -> sh -> Int
...
instance Shape Z where
toIndex _ _ = 0
instance Shape sh => Shape (sh :. Int) where
toIndex (sh1 :. sh2) (sh1' :. sh2')
= toIndex sh1 sh1' * sh2 + sh2'
```

What does dot product look like with the parallel operators?

How are the `Data.Operations`

operators implemented?

`scan`

or prefix sum looks like an inherently sequential operation

```
> scanl (+) 0 [1, 2, 3, 4, 5, 6, 7, 8]
> [0,1,3,6,10,15,21,28,36]
```

This is parallel but does O(n log_{2} n) work.

Sequential scan does O(n) operations.

How long it would take if it were fully sequential (work)

How long it would take if it were as parallel as possible (span)

"Programming parallel algorithms" - Guy Blelloch, 1996

Can we have a work efficient parallel scan?

So are scans really that useful?

- Lexical string comparison
- Adding multi precison numbers
- Polynomial evaluation
- Radix Sort
- Implement parallel quicksort
- ...

"Prefix sums and their applications." - Guy Blelloch, 1990

Q: So are `folds`

, `maps`

, `zips`

and `scans`

sufficient enough combinators?

A: `Maybe`

Examples

Good benchmarks

Heavy reads and less write

Pearson Correlation Coefficient

Bad benchmarks

Heavy writes

Writes perform better now via `thawing`

and `freezing`

inside the ST monad.

- Automatic Vectorisation
- Loop vectorisation
- Throttled SLP algorithm

- An ideal cost model for vectorisation
- Mutable arrays supporting in place writes (inspiration from Futhark)
- More vectorised data structures
- Improving the API of lift-vector
- Alternate IR like SSA inside GHC

Almost any paper by Guy Blelloch

From the Haskell side:

"Automatic SIMD vectorization for Haskell" - Petersen, Orchard, Glew

"Exploiting vector instructions with generalized stream fusion" - Mainland, Leschinskiy, Jones

If you are lazy read my 5k words summary:

Haskell is already a great parallel language!

Haskell could be the state of art in vector programming!

Special Thanks to

- Carter Tazio Schonwald (my GSoC mentor)
- Graham Hutton (my thesis supervisor)
- Ben Gamari (my second GSoC mentor)
- Google for funding Haskell.org as part of GSoC 2018

Looking for PhD Advisors in the field of

High performance functional programming

(^ Not an oxymoron!)

THANK YOU FOR LISTENING

Contact me if this interests you:

Email : asiamgenius@gmail.com

Twitter : @catamorphic