Why does adding concurrency slow down this golang code?
Solution 1:
The issue seems to come from your use of rand.Float64()
, which uses a shared global object with a Mutex lock on it.
Instead, if for each CPU you create a separate rand.New()
, pass it through to the interactions()
, and use it to create the Float64()
, there's a massive improvement.
Update to show the changes to the new example code in the question that now uses rand.New()
The test()
function was modified to either use a given channel, or return the result.
func test(n int, c chan []int) []int {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
simulations := make([]int, n)
for i := range simulations {
for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) {
simulations[i] += v
}
}
if c == nil {
return simulations
}
c <- simulations
return nil
}
The main()
function was updated to run both tests, and output the timed result.
func main() {
rand.Seed(time.Now().UnixNano())
nCPU := runtime.NumCPU()
runtime.GOMAXPROCS(nCPU)
fmt.Println("Number of CPUs: ", nCPU)
start := time.Now()
fmt.Println("Successful interactions: ", len(test(NUMBER_OF_SIMULATIONS, nil)))
fmt.Println(time.Since(start))
start = time.Now()
tests := make([]chan []int, nCPU)
for i := range tests {
c := make(chan []int)
go test(NUMBER_OF_SIMULATIONS/nCPU, c)
tests[i] = c
}
// Concatentate the test results
results := make([]int, NUMBER_OF_SIMULATIONS)
for i, c := range tests {
start := (NUMBER_OF_SIMULATIONS/nCPU) * i
stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)
copy(results[start:stop], <-c)
}
fmt.Println("Successful interactions: ", len(results))
fmt.Println(time.Since(start))
}
The output is I received:
> Number of CPUs: 2 > > Successful interactions: 1000 > 1m20.39959s > > Successful interactions: 1000 > 41.392299s
Solution 2:
Testing your code on my Linux quad core i7 laptop I get this
Here is a Google Spreadsheet
This shows that under Linux at least the scaling is very nearly linear per core.
I think there may be two reasons why you aren't seeing this.
The first is that your macbook air only has 2 real cores. It has 4 hyperthreads though which is why it reports 4 as max cpus. A hyperthread typically only gives an extra 15% performance over a single core rather than the 100% you might expect. So stick to benchmarking 1 or 2 CPUs only on the macbook air!
The other reason might be OS X thread performance compared to Linux. They use different threading models which may be affecting performance.
Solution 3:
Your code is sampling a binomial random variable, B(N, p) where N is the number of trials (here 1M), and p is the probability of a successful individual trial (here 0.0003).
One way to do this is to build a table T of cumulative probabilities, where T[i] contains the probability that the total number of trials is less than or equal to i. To then produce a sample, you can pick a uniform random variable (via rand.Float64) and find the first index in the table that contains a probability greater than or equal to it.
It's a little more complicated here because you've got a really big N and a fairly small p, so if you try to build the table you get trouble with really small numbers and arithmetic accuracy. But you can build a smaller table (say 1000 large) and sample it 1000 times to get your 1 million trials.
Here's some code that does all this. It's not too elegant (1000 is hard-coded in), but it generates 1000 simulations in less than a second on my old laptop. It's easy to optimise further, by for example lifting the construction of the BinomialSampler out of the loop, or by using binary search rather than a linear scan to find the table index.
package main
import (
"fmt"
"math"
"math/rand"
)
type BinomialSampler []float64
func (bs BinomialSampler) Sample() int {
r := rand.Float64()
for i := 0; i < len(bs); i++ {
if bs[i] >= r {
return i
}
}
return len(bs)
}
func NewBinomialSampler(N int, p float64) BinomialSampler {
r := BinomialSampler(make([]float64, N+1))
T := 0.0
choice := 1.0
for i := 0; i <= N; i++ {
T += choice * math.Pow(p, float64(i)) * math.Pow(1-p, float64(N-i))
r[i] = T
choice *= float64(N-i) / float64(i+1)
}
return r
}
func WowSample(N int, p float64) int {
if N%1000 != 0 {
panic("N must be a multiple of 1000")
}
bs := NewBinomialSampler(1000, p)
r := 0
for i := 0; i < N; i += 1000 {
r += bs.Sample()
}
return r
}
func main() {
for i := 0; i < 1000; i++ {
fmt.Println(WowSample(1000000, 0.0003))
}
}
Solution 4:
My results, which show substantial concurrency for 4 CPUs versus 1 CPU:
Intel Core 2 Quad CPU Q8300 @ 2.50GHz x 4
Source code: UPDATE (01/12/13 18:05)
$ go version
go version devel +adf4e96e9aa4 Thu Jan 10 09:57:01 2013 +1100 linux/amd64
$ time go run temp.go
Number of CPUs: 1
real 0m30.305s
user 0m30.210s
sys 0m0.044s
$ time go run temp.go
Number of CPUs: 4
real 0m9.980s
user 0m35.146s
sys 0m0.204s