Moore's Law is over :(
Cray 1 - the first supercomputer to implement the vector processor design
Can a purely functional language support vectorisation?
The question is incomplete...
Can a purely functional language support vectorisation, preserving its original syntax and semantics while being as efficient as low level systems programming languages?
Parsing the previous statement :
We answer that in 2 steps.
LLVM backend for NESL
NESL : An ML Dialect by Guy Blelloch (CMU) which supports Nested Data Parallelism
State of the art optimising compiler
Where have we made code changes?
Fundamental Code changes:
compiler/nativeGen/Format.hs
data Format = ...
| VecFormat !Length !ScalarFormat !Width
type Length = Int
data Width = ... | W128 | W256 | W512
compiler/nativeGen/x86/Instr.hs
data Instr = VBROADCAST Format AddrMode Reg
| VEXTRACT Format Operand Reg Operand
...
-- number of other instructions like vector move,
-- shuffle, arithmetic, logical x86 instructions
compiler/cmm/CmmMachOp.hs
data MachOp = MO_VF_Broadcast Length Width
| MO_VF_Insert ....
70% code changes in compiler/nativeGen/X86/CodeGen.hs
Lots of seg-faults and code reviews later
Float
and Double
SIMD supportThe user-level API
-- constructors
packFloatX4# :: (# Float#, Float#, Float#, Float# #) -> FloatX4#
broadcastFloatX4# :: Float# -> FloatX4#
-- destructors
unpackFloatX4# :: FloatX4# (# Float#, Float#, Float#, Float# #)
-- operations
plusFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
minusFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
timesFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
divideFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
SSE as well as AVX support
Previously we had
data Int8 = I8# Int#
data Int16 = I16# Int#
data Int32 = I32# Int#
data Int64 = I64# Int#
Now GHC has :
Int8#
, Word8#
, Int16#
, Word16#
, Int32#
, Word32#
, Int64#
, Word64#
Example : Vectorizing dot product
This spoils the declarative nature of Haskell!
Imperative languages have powerful vector programming libraries
Polymorphic SIMD functions for vector programming
Lift-vector provides two interfaces
Data.Primitive
class (Num v, Real (Elem v)) => SIMDVector v where
type Elem v
type ElemTuple v
nullVector :: v
vectorSize :: v -> Int
elementSize :: v -> Int
broadcastVector :: Elem v -> v
mapVector :: (Elem v -> Elem v) -> v -> v
zipVector :: (Elem v -> Elem v -> Elem v) -> v -> v -> v
foldVector :: (Elem v -> Elem v -> Elem v) -> v -> Elem v
sumVector :: v -> Elem v
packVector :: ElemTuple v -> v
unpackVector :: v -> ElemTuple v
From github/ajscholl/primitive
What does dot product look like with this?
Data.Operations
class (Num a, Num b) =>
ArithVector t a b
where
-- | The folding function should be commutative.
fold :: (a -> a -> a) -> (b -> b -> b) -> b -> t b -> b
zipVec ::
(a -> a -> a) -> (b -> b -> b) -> t b -> t b -> t b
fmap :: (a -> a) -> (b -> b) -> t b -> t b
-- | Work efficient Blelloch Scan
scanb :: (a -> a -> a) -> (b -> b -> b) -> b -> t b -> t b
newtype VecList a = VecList [a]
A wrapper over plain lists
Wrappers around unboxed vectors.
import qualified Data.Vector.Unboxed as U
data VecArray sh a = VecArray !sh (U.Vector a)
data Z = Z
data tail :. head = !tail :. !head
type DIM0 = Z
type DIM1 = DIM0 :. Int
type DIM2 = DIM1 :. Int
type DIM3 = DIM2 :. Int
class Eq sh => Shape sh where
toIndex :: sh -> sh -> Int
...
instance Shape Z where
toIndex _ _ = 0
instance Shape sh => Shape (sh :. Int) where
toIndex (sh1 :. sh2) (sh1' :. sh2')
= toIndex sh1 sh1' * sh2 + sh2'
What does dot product look like with the parallel operators?
How are the Data.Operations
operators implemented?
scan
or prefix sum looks like an inherently sequential operation
> scanl (+) 0 [1, 2, 3, 4, 5, 6, 7, 8]
> [0,1,3,6,10,15,21,28,36]
This is parallel but does O(n log2 n) work.
Sequential scan does O(n) operations.
How long it would take if it were fully sequential (work)
How long it would take if it were as parallel as possible (span)
"Programming parallel algorithms" - Guy Blelloch, 1996
Can we have a work efficient parallel scan?
So are scans really that useful?
"Prefix sums and their applications." - Guy Blelloch, 1990
Q: So are folds
, maps
, zips
and scans
sufficient enough combinators?
A: Maybe
Examples
Good benchmarks
Heavy reads and less write
Pearson Correlation Coefficient
Bad benchmarks
Heavy writes
Writes perform better now via thawing
and freezing
inside the ST monad.
Almost any paper by Guy Blelloch
From the Haskell side:
"Automatic SIMD vectorization for Haskell" - Petersen, Orchard, Glew
"Exploiting vector instructions with generalized stream fusion" - Mainland, Leschinskiy, Jones
If you are lazy read my 5k words summary:
Haskell is already a great parallel language!
Haskell could be the state of art in vector programming!
Special Thanks to
Looking for PhD Advisors in the field of
High performance functional programming
(^ Not an oxymoron!)
THANK YOU FOR LISTENING
Contact me if this interests you:
Email : asiamgenius@gmail.com
Twitter : @catamorphic