Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Solution 1:

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.

When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).

If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.

Quick benchmark

#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>

using namespace std::chrono;

/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
    /// This stuff allows to use this class wherever a library function
    /// requires a UniformRandomBitGenerator (e.g. std::shuffle)
    typedef uint32_t result_type;
    static uint32_t min() { return 1; }
    static uint32_t max() { return uint32_t(-1); }

    /// PRNG state
    uint32_t y;

    /// Initializes with seed
    XorShift32(uint32_t seed = 0) : y(seed) {
        if(y == 0) y = 2463534242UL;
    }

    /// Returns a value in the range [1, 1<<32)
    uint32_t operator()() {
        y ^= (y<<13);
        y ^= (y>>17);
        y ^= (y<<15);
        return y;
    }

    /// Returns a value in the range [0, limit); this conforms to the RandomFunc
    /// requirements for std::random_shuffle
    uint32_t operator()(uint32_t limit) {
        return (*this)()%limit;
    }
};

struct mean_variance {
    double rmean = 0.;
    double rvariance = 0.;
    int count = 0;

    void operator()(double x) {
        ++count;
        double ormean = rmean;
        rmean     += (x-rmean)/count;
        rvariance += (x-ormean)*(x-rmean);
    }

    double mean()     const { return rmean; }
    double variance() const { return rvariance/(count-1); }
    double stddev()   const { return std::sqrt(variance()); }
};

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

volatile unsigned out;

int main() {
    XorShift32 xs;
    std::vector<uint8_t> vec;
    int size = 10000000;
    for(int i = 0; i<size; ++i) {
        uint32_t v = xs();
        if(v < 1825361101)      v = 0; // 42.5%
        else if(v < 4080218931) v = 1; // 95.0%
        else if(v < 4252017623) v = 2; // 99.0%
        else {
            while((v & 0xff) < 3) v = xs();
        }
        vec.push_back(v);
    }
    populate(vec.data(), vec.size());
    mean_variance lk_t, arr_t;
    for(int i = 0; i<50; ++i) {
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += lookup(xs() % size);
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "lookup: %10d µs\n", dur);
            lk_t(dur);
        }
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += vec[xs() % size];
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "array:  %10d µs\n", dur);
            arr_t(dur);
        }
    }

    fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
    printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
            lk_t.mean(), lk_t.stddev(),
            arr_t.mean(), arr_t.stddev(),
            arr_t.mean()/lk_t.mean());
    return 0;
}

(code and data always updated in my Bitbucket)

The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:

performs a random lookup of 10M elements with my data structure
does the same through the original array.

(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)

These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).

I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.

            CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)
Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49
Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50
Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94
Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03
Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03
Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15
Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27
Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32
Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36
Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40
Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49
Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51
Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85

The results are... mixed!

In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.

Solution 2:

Another option could be

check if the result is 0, 1 or 2
if not do a regular lookup

In other words something like:

unsigned char lookup(int index) {
    int code = (bmap[index>>2]>>(2*(index&3)))&3;
    if (code != 3) return code;
    return full_array[index];
}

where bmap uses 2 bits per element with the value 3 meaning "other".

This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.

Solution 3:

This is more of a "long comment" than a concrete answer

Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".

For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.

I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.

In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?

But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".

The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.

In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.

Solution 4:

What I've done in the past is use a hashmap in front of a bitset.

This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).

Often, however, "cache is king".

Solution 5:

Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.

There are two assumptions in your questions:

The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.

I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.

For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.