Why mysql cluster does not use multiple cores of the CPU?

I have a problem with the ndbmtd process. When I use following configuration I expect both cores on our server with Intel(R) Pentium(R) CPU G6950 @ 2.80GHz will be fully utilized. Unfortunately this is not happening. Only core with id=0 is used. Second one has no load.

My configuration:

[ndbd default]
MaxNoOfExecutionThreads=2
[ndbd]
HostName=192.168.1.4
NodeId=3
LockExecuteThreadToCPU=0,1
LockMaintThreadsToCPU=0

mpstat -P ALL

08:47:09 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:47:11 AM     all     44.64      0.00      1.75      1.25      0.00     52.37
08:47:11 AM       0     89.45      0.00      1.01      2.01      0.00      7.54
08:47:11 AM       1      0.99      0.00      1.98      0.00      0.00     97.03

However, "top" shows 90% usage for ndbmtd process (why?)

My topology - 2 data nodes, ndb_mgmt in VM, mysqld in VM.

Is my CPU not capable of such thing, I have something misconfigured or mysql-cluster is not able to fully load multi-core processors?

Solution 1:

I checked in with MySQL Cluster development team and Frazer Clement provided this detailed response. Let us know how your testing goes. A good place to ask questions specific to MySQL Cluster is the forum: forums.mysql.com/list.php?25

That CPU doesn't have Hyperthreading

So it has 2 real cores.

According to this : http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndbmtd.html , the MaxNoOfExecutionThreads should be set to 2 for a 2-core host.

It also states that when set to 2, there will be :

1 local query handler (LQH) thread

1 transaction coordinator (TC) thread

1 transporter thread

1 subscription manager (SUMA) thread

With plain ndbd, all of these functions are in 1 thread, with ndbmtd, and MaxNoOfExecutionThreads = 2, they are split out as shown. Note that this is a 'functional' split - each thread has a different role, and therefore requires a different amount of CPU to accomplish its part of the work. For a given throughput, the amount of CPU consumed by each thread type will be different.

Higher values of MaxNoOfExecutionThreads will increase the number of LQH threads, which should each take an equal share of the 'LQH' work, and be balanced relative to each other. However, the other threads will have different amounts of CPU consumption.

Finally, the LockExecuteThreadToCpu=0,1 line is used by ndbmtd in a kind of round-robin style. Unfortunately, there are too many execution threads (4) for the number of CPUs provided to give an even balance. So what happens is that the single LQH thread is given one CPU, and the other three threads share the other CPU. This can account for the imbalance seen.

Note that the mapping of threads to cpus is output in the stdout (ndb_out log) of each ndbmtd process when it starts. Using similar config, I see the following :

NDBMT: num_threads=4

Instantiating DBSPJ instanceNo=0

Lock threadId = 3936 to CPU id = 0

Lock threadId = 3935 to CPU id = 0

Lock threadId = 3937 to CPU id = 0

WARNING: Too few CPU's specified with LockExecuteThreadToCPU. Only 2 specified but 4 was needed, this may cause contention.

Assigning LQH threads to dedicated CPU(s) and other threads will share remaining thr: 2 tid: 3940 cpu: 0 OK PGMAN(1) DBACC(1) DBLQH(1) DBTUP(1) BACKUP(1) DBTUX(1) RESTORE(1)

thr: 3 tid: 3933 cpu: 1 OK CMVMI(0)

thr: 1 tid: 3939 cpu: 1 OK BACKUP(0) DBLQH(0) DBACC(0) DBTUP(0) SUMA(0) DBTUX(0) TSMAN(0) LGMAN(0) PGMAN(0) RESTORE(0) DBINFO(0) PGMAN(5)

thr: 0 tid: 3938 cpu: 1 OK DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0) NDBFS(0) TRIX(0) DBUTIL(0) DBSPJ(0)

We can see that one execute thread (3940) is locked to CPU 0, and the others are locked to CPU 1. 3940 is an LQH worker thread (as it has a DBLQH block with a number > 0 (DBLQH(1))).

The CMVMI (network IO receiver), DBLQH(0)/SUMA(0), and DBTC(0) threads are all locked to CPU 1 in this exampe.

So depending on the traffic used, the amount of CPU consumed on CPU 0 vs CPU1 will be out of balance. Note that the 'maintenance' threads are also locked to CPU 0 which, if CPU 0 is saturated, may make things worse.

If the bottleneck for this traffic type is LQH processing, then increasing the MaxNoOfExecutionThreads to 4 or higher will result in there being 2 LQH 'workers', which will each be assigned a core. However, the other threads will also be using one of the cores, which will limit the resources of the LQH worker on that core.

If LQH workers are not the bottleneck, then having extra LQH workers can reduce the CPU available for other threads, and reduce throughput.

I recommend experimenting with the traffic load, checking the ndbmtd output to understand the mapping, and measuring achievable throughput and latency as well as observing the balance and utilisation of the CPU cores.