parallel "Folding" in Haskell
I have a function with type below:
union :: a -> a -> a
And a
has additivity property. So we can regard union
as a version of (+)
Say, we have [a]
, and want to perform a parallel "folding"
, for non-parallel foldling we can do only:
foldl1' union [a]
But how to perform it in parallel?
I can demonstrate problem on Num
values and (+)
function.
For example, we have a list [1,2,3,4,5,6]
and (+)
In parallel we should split
[1,2,3] (+) [4,5,6]
[1,2] (+) [3] (+) [4,5] (+) [6]
([1] (+) [2]) (+) ([3] (+) [4]) (+) ([5] (+) [6])
then each (+)
operation we want to perform in parallel, and combine to answer
[3] (+) [7] (+) [11] = 21
Note, that we split list, or perform operations in any order, because of a
additivity.
Is there any ways to do that using any standard library?
You need to generalize your union
to any associative binary operator ⊕ such that (a ⊕ b) ⊕ c == a ⊕ (b ⊕ c). If at the same time you even have a unit element that is neutral with respect to ⊕, you have a monoid.
The important aspect of associativity is that you can arbitrarily group chunks of consecutive elements in a list, and ⊕ them in any order, since a ⊕ (b ⊕ (c ⊕ d)) == (a ⊕ b) ⊕ (c ⊕ d) - each bracket can be computed in parallel; then you'd need to "reduce" the "sums" of all brackets, and you've got your map-reduce sorted.
In order for this parallellisation to make sense, you need the chunking operation to be faster than ⊕ - otherwise, doing ⊕ sequentially is better than chunking. One such case is when you have a random access "list" - say, an array. Data.Array.Repa has plenty of parallellized folding functions.
If you are thinking of practicising to implement one yourself, you need to pick a good complex function ⊕ such that the benefit will show.
For example:
import Control.Parallel
import Data.List
pfold :: (Num a, Enum a) => (a -> a -> a) -> [a] -> a
pfold _ [x] = x
pfold mappend xs = (ys `par` zs) `pseq` (ys `mappend` zs) where
len = length xs
(ys', zs') = splitAt (len `div` 2) xs
ys = pfold mappend ys'
zs = pfold mappend zs'
main = print $ pfold (+) [ foldl' (*) 1 [1..x] | x <- [1..5000] ]
-- need a more complicated computation than (+) of numbers
-- so we produce a list of products of many numbers
Here I deliberately use a associative operation, which is called mappend
only locally, to show it can work for a weaker notion than a monoid - only associativity matters for parallelism; since parallelism makes sense only for non-empty lists anyway, no need for mempty
.
ghc -O2 -threaded a.hs
a +RTS -N1 -s
Gives 8.78 seconds total run time, whereas
a +RTS -N2 -s
Gives 5.89 seconds total run time on my dual core laptop. Obviously, no point trying more than -N2 on this machine.
What you've described is essentially a monoid. In GHCI:
Prelude> :m + Data.Monoid
Prelude Data.Monoid> :info Monoid
class Monoid a where
mempty :: a
mappend :: a -> a -> a
mconcat :: [a] -> a
As you can see a monoid has three associated functions:
-
The
mempty
function is sort of like the identity function of the monoid. For example aNum
can behave as a monoid apropos two operations: sum and product. For a summempty
is defined as0
. For a productmempty
is defined as1
.mempty `mappend` a = a a `mappend` mempty = a
The
mappend
function is similar to yourunion
function. For exampe for a sum ofNum
smappend
is defined as(+)
and for a product ofNum
smappend
is defined as(*)
.-
The
mconcat
function is similar to a fold. However because of the properties of a monoid it doesn't matter whether we fold from the left, fold from the right or fold from an arbitrary position. This is becausemappend
is supposed to be associative:(a `mappend` b) `mappend` c = a `mappend` (b `mappend` c)
Note however that Haskell doesn't enforce the monoid laws. Hence if you make a type an instance of the Monoid
typeclass then you're responsible to ensure that it satisfies the monoid laws.
In your case it's difficult to understand how union
behaves from its type signature: a -> a -> a
. Surely you can't make a type variable an instance of a typeclass. That's not allowed. You need to be more specific. What does union
actually do?
To give you an example of how to make a type an instance of the monoid typeclass:
newtype Sum a = Sum { getSum :: a }
instance Num a => Monoid (Sum a) where
mempty = 0
mappend = (+)
That's it. We don't need to define the mconcat
function because that has a default implementation that depends upon mempty
and mappend
. Hence when we define mempty
and mappend
we get mconcat
for free.
Now you can use Sum
as follows:
getSum . mconcat $ map Sum [1..6]
This is what's happening:
- You're mapping the
Sum
constructor over[1..6]
to produce[Sum 1, Sum 2, Sum 3, Sum 4, Sum 5, Sum 6]
. - You give the resulting list to
mconcat
which folds it toSum 21
. - You use
getSum
to extract theNum
fromSum 21
.
Note however that the default implementation of mconcat
is foldr mappend mempty
(i.e. it's a right fold). For most cases the default implementation is sufficient. However in your case you might want to override the default implementation:
foldParallel :: Monoid a => [a] -> a
foldParallel [] = mempty
foldParallel [a] = a
foldParallel xs = foldParallel left `mappend` foldParallel right
where size = length xs
index = (size + size `mod` 2) `div` 2
(left, right) = splitAt index xs
Now we can create a new instance of Monoid
as follows:
data Something a = Something { getSomething :: a }
instance Monoid (Something a) where
mempty = unionEmpty
mappend = union
mconcat = foldParallel
We use it as follows:
getSomething . mconcat $ map Something [1..6]
Note that I defined mempty
as unionEmpty
. I don't know what type of data the union
function acts on. Hence I don't know what mempty
should be defined as. Thus I'm simply calling it unionEmpty
. Define it as you see fit.