How to increase performance of memcpy
Solution 1:
I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.
Performance (10000x 4MB block memcpy):
1 thread : 1826 MB/sec
2 threads: 3118 MB/sec
3 threads: 4121 MB/sec
4 threads: 10020 MB/sec
5 threads: 12848 MB/sec
6 threads: 14340 MB/sec
8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec
I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?
I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.
#define NUM_CPY_THREADS 4
HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
int ct;
void * src, * dest;
size_t size;
} mt_cpy_t;
mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};
DWORD WINAPI thread_copy_proc(LPVOID param)
{
mt_cpy_t * p = (mt_cpy_t * ) param;
while(1)
{
WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
memcpy(p->dest, p->src, p->size);
ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
}
return 0;
}
int startCopyThreads()
{
for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
{
hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
mtParamters[ctr].ct = ctr;
hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL);
}
return 0;
}
void * mt_memcpy(void * dest, void * src, size_t bytes)
{
//set up parameters
for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
{
mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
}
//release semaphores to start computation
for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);
//wait for all threads to finish
WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);
return dest;
}
int stopCopyThreads()
{
for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
{
TerminateThread(hCopyThreads[ctr], 0);
CloseHandle(hCopyStartSemaphores[ctr]);
CloseHandle(hCopyStopSemaphores[ctr]);
}
return 0;
}
Solution 2:
I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.
Try this implementation.
Yeah, and make sure that both the source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)
Solution 3:
One thing to be aware of is that your process (and hence the performance of memcpy()
) is impacted by the OS scheduling of tasks - it's hard to say how much of a factor this is in your timings, bu tit is difficult to control. The device DMA operation isn't subject to this, since it isn't running on the CPU once it's kicked off. Since your application is an actual real-time application though, you might want to experiment with Windows' process/thread priority settings if you haven't already. Just keep in mind that you have to be careful about this because it can have a really negative impact in other processes (and the user experience on the machine).
Another thing to keep in mind is that the OS memory virtualization might have an impact here - if the memory pages you're copying to aren't actually backed by physical RAM pages, the memcpy()
operation will fault to the OS to get that physical backing in place. Your DMA pages are likely to be locked into physical memory (since they have to be for the DMA operation), so the source memory to memcpy()
is likely not an issue in this regard. You might consider using the Win32 VirtualAlloc()
API to ensure that your destination memory for the memcpy()
is committed (I think VirtualAlloc()
is the right API for this, but there might be a better one that I'm forgetting - it's been a while since I've had a need to do anything like this).
Finally, see if you can use the technique explained by Skizz to avoid the memcpy()
altogether - that's your best bet if resources permit.
Solution 4:
You have a few barriers to obtaining the required memory performance:
Bandwidth - there is a limit to how quickly data can move from memory to the CPU and back again. According to this Wikipedia article, 266MHz DDR3 RAM has an upper limit of around 17GB/s. Now, with a memcpy you need to halve this to get your maximum transfer rate since the data is read and then written. From your benchmark results, it looks like you're not running the fastest possible RAM in your system. If you can afford it, upgrade the motherboard / RAM (and it won't be cheap, Overclockers in the UK currently have 3x4GB PC16000 at £400)
The OS - Windows is a preemptive multitasking OS so every so often your process will be suspended to allow other processes to have a look in and do stuff. This will clobber your caches and stall your transfer. In the worst case your entire process could be cached to disk!
The CPU - the data being moved has a long way to go: RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM. There may even be an L3 cache. If you want to involve the CPU you really want to be loading L2 whilst copying L1. Unfortunately, modern CPUs can run through an L1 cache block quicker than the time taken to load the L1. The CPU has a memory controller that helps a lot in these cases where your streaming data into the CPU sequentially but you're still going to have problems.
Of course, the faster way to do something is to not do it. Can the captured data be written anywhere in RAM or is the buffer used at a fixed location. If you can write it anywhere, then you don't need the memcpy at all. If it's fixed, could you process the data in place and use a double buffer type system? That is, start capturing data and when it's half full, start processing the first half of the data. When the buffer's full, start writing captured data to the start and process the second half. This requires that the algorithm can process the data faster than the capture card produces it. It also assumes that the data is discarded after processing. Effectively, this is a memcpy with a transformation as part of the copy process, so you've got:
load -> transform -> save
\--/ \--/
capture card RAM
buffer
instead of:
load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM
Or get faster RAM!
EDIT: Another option is to process the data between the data source and the PC - could you put a DSP / FPGA in there at all? Custom hardware will always be faster than a general purpose CPU.
Another thought: It's been a while since I've done any high performance graphics stuff, but could you DMA the data into the graphics card and then DMA it out again? You could even take advantage of CUDA to do some of the processing. This would take the CPU out of the memory transfer loop altogether.
Solution 5:
First of all, you need to check that memory is aligned on 16 byte boundary, otherwise you get penalties. This is the most important thing.
If you don't need a standard-compliant solution, you could check if things improve by using some compiler specific extension such as memcpy64
(check with your compiler doc if there's something available). Fact is that memcpy
must be able to deal with single byte copy, but moving 4 or 8 bytes at a time is much faster if you don't have this restriction.
Again, is it an option for you to write inline assembly code?