Why does adding concurrency slow down this golang code?

Solution 1:

The issue seems to come from your use of rand.Float64(), which uses a shared global object with a Mutex lock on it.

Instead, if for each CPU you create a separate rand.New(), pass it through to the interactions(), and use it to create the Float64(), there's a massive improvement.


Update to show the changes to the new example code in the question that now uses rand.New()

The test() function was modified to either use a given channel, or return the result.

func test(n int, c chan []int) []int {
    source := rand.NewSource(time.Now().UnixNano())
    generator := rand.New(source)
    simulations := make([]int, n)
    for i := range simulations {
        for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) {
            simulations[i] += v
        }   
    }   
    if c == nil {
        return simulations
    }   
    c <- simulations
    return nil 
}

The main() function was updated to run both tests, and output the timed result.

func main() {
    rand.Seed(time.Now().UnixNano())

    nCPU := runtime.NumCPU()
    runtime.GOMAXPROCS(nCPU)
    fmt.Println("Number of CPUs: ", nCPU)

    start := time.Now()
    fmt.Println("Successful interactions: ", len(test(NUMBER_OF_SIMULATIONS, nil)))
    fmt.Println(time.Since(start))

    start = time.Now()
    tests := make([]chan []int, nCPU)
    for i := range tests {
        c := make(chan []int)
        go test(NUMBER_OF_SIMULATIONS/nCPU, c)
        tests[i] = c
    }

    // Concatentate the test results
    results := make([]int, NUMBER_OF_SIMULATIONS)
    for i, c := range tests {
        start := (NUMBER_OF_SIMULATIONS/nCPU) * i
        stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)
        copy(results[start:stop], <-c)
    }
    fmt.Println("Successful interactions: ", len(results))
    fmt.Println(time.Since(start))
}

The output is I received:

> Number of CPUs:  2 
>
> Successful interactions:  1000 
> 1m20.39959s
>
> Successful interactions:  1000
> 41.392299s

Solution 2:

Testing your code on my Linux quad core i7 laptop I get this

Here is a Google Spreadsheet

Screenshot of Google Spreadsheet

This shows that under Linux at least the scaling is very nearly linear per core.

I think there may be two reasons why you aren't seeing this.

The first is that your macbook air only has 2 real cores. It has 4 hyperthreads though which is why it reports 4 as max cpus. A hyperthread typically only gives an extra 15% performance over a single core rather than the 100% you might expect. So stick to benchmarking 1 or 2 CPUs only on the macbook air!

The other reason might be OS X thread performance compared to Linux. They use different threading models which may be affecting performance.

Solution 3:

Your code is sampling a binomial random variable, B(N, p) where N is the number of trials (here 1M), and p is the probability of a successful individual trial (here 0.0003).

One way to do this is to build a table T of cumulative probabilities, where T[i] contains the probability that the total number of trials is less than or equal to i. To then produce a sample, you can pick a uniform random variable (via rand.Float64) and find the first index in the table that contains a probability greater than or equal to it.

It's a little more complicated here because you've got a really big N and a fairly small p, so if you try to build the table you get trouble with really small numbers and arithmetic accuracy. But you can build a smaller table (say 1000 large) and sample it 1000 times to get your 1 million trials.

Here's some code that does all this. It's not too elegant (1000 is hard-coded in), but it generates 1000 simulations in less than a second on my old laptop. It's easy to optimise further, by for example lifting the construction of the BinomialSampler out of the loop, or by using binary search rather than a linear scan to find the table index.

package main

import (
    "fmt"
    "math"
    "math/rand"
)

type BinomialSampler []float64

func (bs BinomialSampler) Sample() int {
    r := rand.Float64()
    for i := 0; i < len(bs); i++ {
        if bs[i] >= r {
            return i
        }
    }
    return len(bs)
}

func NewBinomialSampler(N int, p float64) BinomialSampler {
    r := BinomialSampler(make([]float64, N+1))
    T := 0.0
    choice := 1.0
    for i := 0; i <= N; i++ {
        T += choice * math.Pow(p, float64(i)) * math.Pow(1-p, float64(N-i))
        r[i] = T
        choice *= float64(N-i) / float64(i+1)
    }
    return r
}

func WowSample(N int, p float64) int {
    if N%1000 != 0 {
        panic("N must be a multiple of 1000")
    }
    bs := NewBinomialSampler(1000, p)
    r := 0
    for i := 0; i < N; i += 1000 {
        r += bs.Sample()
    }
    return r
}

func main() {
    for i := 0; i < 1000; i++ {
        fmt.Println(WowSample(1000000, 0.0003))
    }
}

Solution 4:

My results, which show substantial concurrency for 4 CPUs versus 1 CPU:

Intel Core 2 Quad CPU Q8300 @ 2.50GHz x 4

Source code: UPDATE (01/12/13 18:05)

$ go version
go version devel +adf4e96e9aa4 Thu Jan 10 09:57:01 2013 +1100 linux/amd64

$ time  go run temp.go
Number of CPUs:  1
real    0m30.305s
user    0m30.210s
sys     0m0.044s

$ time  go run temp.go
Number of CPUs:  4
real    0m9.980s
user    0m35.146s
sys     0m0.204s