We focus on CPU vector programming.
Can a purely functional language support vectorisation?
Enriching the question...
Can a purely functional language support vectorisation, preserving its original syntax and semantics while being as efficient as low level systems programming languages?
We will attempt to answer this throughout the rest of the presentation
We will avoid talking about the engineering aspects and try to focus on the research side of the problem.
Float
supports FloatX4
, a vector data type
Double
supports DoubleX2
, a vector data type
The user-level API
-- constructors
packFloatX4# :: (# Float#, Float#, Float#, Float# #) -> FloatX4#
broadcastFloatX4# :: Float# -> FloatX4#
-- destructors
unpackFloatX4# :: FloatX4# (# Float#, Float#, Float#, Float# #)
-- operations
plusFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
minusFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
timesFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
divideFloatX4# :: FloatX4# -> FloatX4# -> FloatX4#
SSE as well as AVX support
Example : Vectorizing dot product
This spoils the declarative nature of Haskell!
Polymorphic SIMD functions for vector programming
Lift-vector provides two interfaces
Data.Primitive
class (Num v, Real (Elem v)) => SIMDVector v where
type Elem v
type ElemTuple v
nullVector :: v
vectorSize :: v -> Int
elementSize :: v -> Int
broadcastVector :: Elem v -> v
mapVector :: (Elem v -> Elem v) -> v -> v
zipVector :: (Elem v -> Elem v -> Elem v) -> v -> v -> v
foldVector :: (Elem v -> Elem v -> Elem v) -> v -> Elem v
sumVector :: v -> Elem v
packVector :: ElemTuple v -> v
unpackVector :: v -> ElemTuple v
Data.Operations
class (Num a, Num b) =>
ArithVector t a b
where
-- | The folding function should be commutative.
fold :: (a -> a -> a) -> (b -> b -> b) -> b -> t b -> b
zipVec ::
(a -> a -> a) -> (b -> b -> b) -> t b -> t b -> t b
fmap :: (a -> a) -> (b -> b) -> t b -> t b
-- | Work efficient Blelloch Scan
scanb :: (a -> a -> a) -> (b -> b -> b) -> b -> t b -> t b
newtype VecList a = VecList [a]
A wrapper over plain lists
Wrappers around unboxed vectors.
import qualified Data.Vector.Unboxed as U
data VecArray sh a = VecArray !sh (U.Vector a)
What does dot product look like with the parallel operators?
Vector Lists
dotVec :: [Float] -> [Float] -> Float
dotVec xs ys =
fold (\x y -> x + y :: FloatX4) (\x y -> x + y :: Float) 0 $
zipVec
(\x y -> x * y :: FloatX4)
(\x y -> x * y :: Float)
(toVecList xs)
(toVecList ys)
Vector Arrays
dotVec' :: [Float] -> [Float] -> Float
dotVec' xs ys
= let l1 = length xs
l2 = length ys
in fold (\x y -> x + y :: FloatX4) (\x y -> x + y :: Float) 0 $
zipVec
(\x y -> x * y :: FloatX4)
(\x y -> x * y :: Float)
(toVecArray (Z :. l1) xs)
(toVecArray (Z :. l2) ys)
How are the Data.Operations
operators implemented?
scan
or prefix sum looks like an inherently sequential operation
> scanl (+) 0 [1, 2, 3, 4, 5, 6, 7, 8]
> [0,1,3,6,10,15,21,28,36]
Lift-vector supports a work efficient parallel scan
Benchmarks
Good benchmarks
Heavy reads and less write
Pearson Correlation Coefficient
Bad benchmarks
Heavy writes
Eg: Matrix Multiplication
import qualified Data.Operations as Op
foo :: Float -> Float
foo x = x*x + 1
foo' :: [Float] -> [Float]
foo' xs =
Op.fromVecList $
Op.fmap (\x -> x + 1 :: Float) (\x -> x + 1 :: FloatX4) $
Op.fmap (\x -> x * x :: Float) (\x -> x * x :: FloatX4) $
Op.toVecList xs
Any calls to map foo
should be replaced by foo'
Now,
bar :: [Float] -> [Float]
bar xs = map foo xs
bar :: [Float] -> [Float]
bar = foo'
bar' :: [[Float]] -> [[Float]]
bar' xss = ???
Flatten all levels of nesting
concat :: (ArithVector t) => t (t a) -> t a
unconcat :: (ArithVector t) => t (t a) -> t b -> t (t b)
bar' :: [[Float]] -> [[Float]]
bar' xss = unconcat xss (bar (concat xss))
This is manually impossible to define in a library
Need the compiler to understand nested structures
Records are desugared to Product Types.
Very common to have array of tuples.
VecArray (Float, Float, Float)
or VecList (Float, Float, Float)
Current hardware supports load and store vector instructions only for contiguous data
*Xeon Phi instruction set supports non contiguous loads and stores
Generally structure of arrays (SOA) layout is optimal over array of structures (AOS) layout
This was not explored in the existing design of lift-vector
What are we looking for?
Tradeoff between :
Programming model
Whole function vectorization : control flow to data flow - Karrenberg, 2017
-- all functions have parallel semantics unless specified otherwise
quicksort :: [Float] -> [Float]
quicksort [] = []
quicksort xs =
let pivot = srandom xs -- assume srandom pure & sequential
small = select (p <= pivot) xs
large = select (p > pivot) xs
in (quicksort small) ++ [x] ++ (quicksort large)
gang1 gang2
-- irregular data size but NDP can be made work efficient
Simply uses relational operators
Vector instruction CMPPS
in SSE or VCMPPS
in AVX
Merge sort is easier to parallelize
[16 elems] [16elems]
8 8 8 8
4 4 4 4 4 4 4 4
mergesort_spmd xs
let pivot = smiddle xs
(x1, x2) = split xs
in merge (mergesort_spmd x1) (mergesort_spmd x2)
merge :: [a] -> [a] -> [a] - Vectorized SIMD union
xs = [9,7,8,1]
/ | \ \
/ | \ \
/ | \ \
/ | \ \
(fork xs) (fork xs) (fork xs) (fork xs)
foo = pack (9,7,8,1)
bar = broadcast pivot
(a,b) foo < bar
An existing model on GPUs - SPMD (OpenCL)!
Intel ispc compiler - SPMD on SIMD
So how do we implement our language?
Haskell has shown[1] that performance at the cost of expressivity is not necessary
There is a always a sweet spot!
[1] Exploiting Vector Instructions with Generalized Stream Fusion. Mainland et al
THANK YOU FOR LISTENING
EXTRA SLIDES
Convert IR to SSA form
Cytron's algorithm
Hand made graph still SLP vectorized version slower than sequential version of the program
Throttle the vectorization by cutting the graph
Throttling is NP Hard
Can we auto vectorise faster?