1D or 2D array, what's faster?
I'm in need of representing a 2D field (axes x, y) and I face a problem: Should I use an 1D array or a 2D array?
I can imagine, that recalculating indices for 1D arrays (y + x*n) could be slower than using 2D array (x, y) but I could image that 1D could be in CPU cache..
I did some googling, but only found pages regarding static array (and stating that 1D and 2D are basically the same). But my arrays must me dynamic.
So, what's
- faster,
- smaller (RAM)
dynamic 1D arrays or dynamic 2D arrays?
Solution 1:
tl;dr : You should probably use a one-dimensional approach.
Note: One cannot dig into detail affecting performance when comparing dynamic 1d or dynamic 2d storage patterns without filling books since the performance of code is dependent one a very large number of parameters. Profile if possible.
1. What's faster?
For dense matrices the 1D approach is likely to be faster since it offers better memory locality and less allocation and deallocation overhead.
2. What's smaller?
Dynamic-1D consumes less memory than the 2D approach. The latter also requires more allocations.
Remarks
I laid out a pretty long answer beneath with several reasons but I want to make some remarks on your assumptions first.
I can imagine, that recalculating indices for 1D arrays (y + x*n) could be slower than using 2D array (x, y)
Let's compare these two functions:
int get_2d (int **p, int r, int c) { return p[r][c]; }
int get_1d (int *p, int r, int c) { return p[c + C*r]; }
The (non-inlined) assembly generated by Visual Studio 2015 RC for those functions (with optimizations turned on) is:
?get_1d@@YAHPAHII@Z PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _c$[ebp]
lea eax, DWORD PTR [eax+edx*4]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0
?get_2d@@YAHPAPAHII@Z PROC
push ebp
mov ebp, esp
mov ecx, DWORD PTR [ecx+edx*4]
mov eax, DWORD PTR _c$[ebp]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0
The difference is mov
(2d) vs. lea
(1d).
The former has a latency of 3 cycles and a a maximum throughput of 2 per cycle while the latter has a latency of 2 cycles and a maximum throughput of 3 per cycle. (According to Instruction tables - Agner Fog
Since the differences are minor, I think there should not be a big performance difference arising from index recalculation. I expect it to be very unlikely to identify this difference itself to be the bottleneck in any program.
This brings us to the next (and more interesting) point:
... but I could image that 1D could be in CPU cache ...
True, but 2d could be in CPU cache, too. See The Downsides: Memory locality for an explanation why 1d is still better.
The long answer, or why dynamic 2 dimensional data storage (pointer-to-pointer or vector-of-vector) is "bad" for simple / small matrices.
Note: This is about dynamic arrays/allocation schemes [malloc/new/vector etc.]. A static 2d array is a contiguous block of memory and therefore not subject to the downsides I'm going to present here.
The Problem
To be able to understand why a dynamic array of dynamic arrays or a vector of vectors is most likely not the data storage pattern of choice, you are required to understand the memory layout of such structures.
Example case using pointer to pointer syntax
int main (void)
{
// allocate memory for 4x4 integers; quick & dirty
int ** p = new int*[4];
for (size_t i=0; i<4; ++i) p[i] = new int[4];
// do some stuff here, using p[x][y]
// deallocate memory
for (size_t i=0; i<4; ++i) delete[] p[i];
delete[] p;
}
The downsides
Memory locality
For this “matrix” you allocate one block of four pointers and four blocks of four integers. All of the allocations are unrelated and can therefore result in an arbitrary memory position.
The following image will give you an idea of how the memory may look like.
For the real 2d case:
- The violet square is the memory position occupied by
p
itself. - The green squares assemble the memory region
p
points to (4 xint*
). - The 4 regions of 4 contiguous blue squares are the ones pointed to by each
int*
of the green region
For the 2d mapped on 1d case:
- The green square is the only required pointer
int *
- The blue squares ensemble the memory region for all matrix elements (16 x
int
).
This means that (when using the left layout) you will probably observe worse performance than for a contiguous storage pattern (as seen on the right), due to caching for instance.
Let's say a cache line is "the amount of data transfered into the cache at once" and let's imagine a program accessing the whole matrix one element after another.
If you have a properly aligned 4 times 4 matrix of 32 bit values, a processor with a 64 byte cache line (typical value) is able to "one-shot" the data (4*4*4 = 64 bytes). If you start processing and the data isn't already in the cache you'll face a cache miss and the data will be fetched from main memory. This load can fetch the whole matrix at once since it fits into a cache line, if and only if it is contiguously stored (and properly aligned). There will probably not be any more misses while processing that data.
In case of a dynamic, "real two-dimensional" system with unrelated locations of each row/column, the processor needs to load every memory location seperately. Eventhough only 64 bytes are required, loading 4 cache lines for 4 unrelated memory positions would -in a worst case scenario- actually transfer 256 bytes and waste 75% throughput bandwidth. If you process the data using the 2d-scheme you'll again (if not already cached) face a cache miss on the first element. But now, only the first row/colum will be in the cache after the first load from main memory because all other rows are located somewhere else in memory and not adjacent to the first one. As soon as you reach a new row/column there will again be a cache miss and the next load from main memory is performed.
Long story short: The 2d pattern has a higher chance of cache misses with the 1d scheme offering better potential for performance due to locality of the data.
Frequent Allocation / Deallocation
- As many as
N + 1
(4 + 1 = 5) allocations (using either new, malloc, allocator::allocate or whatever) are necessary to create the desired NxM (4×4) matrix. - The same number of proper, respective deallocation operations must be applied as well.
Therefore, it is more costly to create/copy such matrices in contrast to a single allocation scheme.
This is getting even worse with a growing number of rows.
Memory consumption overhead
I'll asumme a size of 32 bits for int and 32 bits for pointers. (Note: System dependency.)
Let's remember: We want to store a 4×4 int matrix which means 64 bytes.
For a NxM matrix, stored with the presented pointer-to-pointer scheme we consume
-
N*M*sizeof(int)
[the actual blue data] + -
N*sizeof(int*)
[the green pointers] + -
sizeof(int**)
[the violet variable p] bytes.
That makes 4*4*4 + 4*4 + 4 = 84
bytes in case of the present example and it gets even worse when using std::vector<std::vector<int>>
.
It will require N * M * sizeof(int)
+ N * sizeof(vector<int>)
+ sizeof(vector<vector<int>>)
bytes, that is 4*4*4 + 4*16 + 16 = 144
bytes in total, intead of 64 bytes for 4 x 4 int.
In addition -depending on the used allocator- each single allocation may well (and most likely will) have another 16 bytes of memory overhead. (Some “Infobytes” which store the number of allocated bytes for the purpose of proper deallocation.)
This means the worst case is:
N*(16+M*sizeof(int)) + 16+N*sizeof(int*) + sizeof(int**)
= 4*(16+4*4) + 16+4*4 + 4 = 164 bytes ! _Overhead: 156%_
The share of the overhead will reduce as the size of the matrix grows but will still be present.
Risk of memory leaks
The bunch of allocations requires an appropriate exception handling in order to avoid memory leaks if one of the allocations will fail! You’ll need to keep track of allocated memory blocks and you must not forget them when deallocating the memory.
If new
runs of of memory and the next row cannot be allocated (especially likely when the matrix is very large), a std::bad_alloc
is thrown by new
.
Example:
In the above mentioned new/delete example, we'll face some more code if we want to avoid leaks in case of bad_alloc
exceptions.
// allocate memory for 4x4 integers; quick & dirty
size_t const N = 4;
// we don't need try for this allocation
// if it fails there is no leak
int ** p = new int*[N];
size_t allocs(0U);
try
{ // try block doing further allocations
for (size_t i=0; i<N; ++i)
{
p[i] = new int[4]; // allocate
++allocs; // advance counter if no exception occured
}
}
catch (std::bad_alloc & be)
{ // if an exception occurs we need to free out memory
for (size_t i=0; i<allocs; ++i) delete[] p[i]; // free all alloced p[i]s
delete[] p; // free p
throw; // rethrow bad_alloc
}
/*
do some stuff here, using p[x][y]
*/
// deallocate memory accoding to the number of allocations
for (size_t i=0; i<allocs; ++i) delete[] p[i];
delete[] p;
Summary
There are cases where "real 2d" memory layouts fit and make sense (i.e. if the number of columns per row is not constant) but in the most simple and common 2D data storage cases they just bloat the complexity of your code and reduce the performance and memory efficiency of your program.
Alternative
You should use a contiguous block of memory and map your rows onto that block.
The "C++ way" of doing it is probably to write a class that manages your memory while considering important things like
- What is The Rule of Three?
- What is meant by Resource Acquisition is Initialization (RAII)?
- C++ concept: Container (on cppreference.com)
Example
To provide an idea of how such a class may look like, here's a simple example with some basic features:
- 2d-size-constructible
- 2d-resizable
-
operator(size_t, size_t)
for 2d- row major element access -
at(size_t, size_t)
for checked 2d-row major element access - Fulfills Concept requirements for Container
Source:
#include <vector>
#include <algorithm>
#include <iterator>
#include <utility>
namespace matrices
{
template<class T>
class simple
{
public:
// misc types
using data_type = std::vector<T>;
using value_type = typename std::vector<T>::value_type;
using size_type = typename std::vector<T>::size_type;
// ref
using reference = typename std::vector<T>::reference;
using const_reference = typename std::vector<T>::const_reference;
// iter
using iterator = typename std::vector<T>::iterator;
using const_iterator = typename std::vector<T>::const_iterator;
// reverse iter
using reverse_iterator = typename std::vector<T>::reverse_iterator;
using const_reverse_iterator = typename std::vector<T>::const_reverse_iterator;
// empty construction
simple() = default;
// default-insert rows*cols values
simple(size_type rows, size_type cols)
: m_rows(rows), m_cols(cols), m_data(rows*cols)
{}
// copy initialized matrix rows*cols
simple(size_type rows, size_type cols, const_reference val)
: m_rows(rows), m_cols(cols), m_data(rows*cols, val)
{}
// 1d-iterators
iterator begin() { return m_data.begin(); }
iterator end() { return m_data.end(); }
const_iterator begin() const { return m_data.begin(); }
const_iterator end() const { return m_data.end(); }
const_iterator cbegin() const { return m_data.cbegin(); }
const_iterator cend() const { return m_data.cend(); }
reverse_iterator rbegin() { return m_data.rbegin(); }
reverse_iterator rend() { return m_data.rend(); }
const_reverse_iterator rbegin() const { return m_data.rbegin(); }
const_reverse_iterator rend() const { return m_data.rend(); }
const_reverse_iterator crbegin() const { return m_data.crbegin(); }
const_reverse_iterator crend() const { return m_data.crend(); }
// element access (row major indexation)
reference operator() (size_type const row,
size_type const column)
{
return m_data[m_cols*row + column];
}
const_reference operator() (size_type const row,
size_type const column) const
{
return m_data[m_cols*row + column];
}
reference at() (size_type const row, size_type const column)
{
return m_data.at(m_cols*row + column);
}
const_reference at() (size_type const row, size_type const column) const
{
return m_data.at(m_cols*row + column);
}
// resizing
void resize(size_type new_rows, size_type new_cols)
{
// new matrix new_rows times new_cols
simple tmp(new_rows, new_cols);
// select smaller row and col size
auto mc = std::min(m_cols, new_cols);
auto mr = std::min(m_rows, new_rows);
for (size_type i(0U); i < mr; ++i)
{
// iterators to begin of rows
auto row = begin() + i*m_cols;
auto tmp_row = tmp.begin() + i*new_cols;
// move mc elements to tmp
std::move(row, row + mc, tmp_row);
}
// move assignment to this
*this = std::move(tmp);
}
// size and capacity
size_type size() const { return m_data.size(); }
size_type max_size() const { return m_data.max_size(); }
bool empty() const { return m_data.empty(); }
// dimensionality
size_type rows() const { return m_rows; }
size_type cols() const { return m_cols; }
// data swapping
void swap(simple &rhs)
{
using std::swap;
m_data.swap(rhs.m_data);
swap(m_rows, rhs.m_rows);
swap(m_cols, rhs.m_cols);
}
private:
// content
size_type m_rows{ 0u };
size_type m_cols{ 0u };
data_type m_data{};
};
template<class T>
void swap(simple<T> & lhs, simple<T> & rhs)
{
lhs.swap(rhs);
}
template<class T>
bool operator== (simple<T> const &a, simple<T> const &b)
{
if (a.rows() != b.rows() || a.cols() != b.cols())
{
return false;
}
return std::equal(a.begin(), a.end(), b.begin(), b.end());
}
template<class T>
bool operator!= (simple<T> const &a, simple<T> const &b)
{
return !(a == b);
}
}
Note several things here:
-
T
needs to fulfill the requirements of the usedstd::vector
member functions -
operator()
doesn't do any "of of range" checks - No need to manage data on your own
- No destructor, copy constructor or assignment operators required
So you don't have to bother about proper memory handling for each application but only once for the class you write.
Restrictions
There may be cases where a dynamic "real" two dimensional structure is favourable. This is for instance the case if
- the matrix is very large and sparse (if any of the rows do not even need to be allocated but can be handled using a nullptr) or if
- the rows do not have the same number of columns (that is if you don't have a matrix at all but another two-dimensional construct).
Solution 2:
Unless you are talking about static arrays, 1D is faster.
Here’s the memory layout of a 1D array (std::vector<T>
):
+---+---+---+---+---+---+---+---+---+
| | | | | | | | | |
+---+---+---+---+---+---+---+---+---+
And here’s the same for a dynamic 2D array (std::vector<std::vector<T>>
):
+---+---+---+
| * | * | * |
+-|-+-|-+-|-+
| | V
| | +---+---+---+
| | | | | |
| | +---+---+---+
| V
| +---+---+---+
| | | | |
| +---+---+---+
V
+---+---+---+
| | | |
+---+---+---+
Clearly the 2D case loses the cache locality and uses more memory. It also introduces an extra indirection (and thus an extra pointer to follow) but the first array has the overhead of calculating the indices so these even out more or less.
Solution 3:
1D and 2D Static Arrays
-
Size: Both will require the same amount of memory.
-
Speed: You can assume that there will be no speed difference because the memory for both of these arrays should be contiguous (The whole 2D array should appear as one chunk in memory rather than a bunch of chunks spread across memory). (This could be compiler dependent however.)
1D and 2D Dynamic Arrays
-
Size: The 2D array will require a tiny bit more memory than the 1D array because of the pointers needed in the 2D array to point to the set of allocated 1D arrays. (This tiny bit is only tiny when we're talking about really big arrays. For small arrays, the tiny bit could be pretty big relatively speaking.)
-
Speed: The 1D array may be faster than the 2D array because the memory for the 2D array would not be contiguous, so cache misses would become a problem.
Use what works and seems most logical, and if you face speed problems, then refactor.