Cannot create index in mongodb, "key too large to index"

I am creating index in mongodb having 10 million records but following error

db.logcollection.ensureIndex({"Module":1})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 3,
        "ok" : 0,
        "errmsg" : "Btree::insert: key too large to index, failing play.logcollection.$Module_1 1100 { : \"RezGainUISystem.Net.WebException: The request was aborted: The request was canceled.\r\n   at System.Net.ConnectStream.InternalWrite(Boolean async, Byte...\" }",
        "code" : 17282
}

Please help me how to createindex in mongodb,


Solution 1:

MongoDB will not create an index on a collection if the index entry for an existing document exceeds the index key limit (1024 bytes). You can however create a hashed index or text index instead:

db.logcollection.createIndex({"Module":"hashed"})

or

db.logcollection.createIndex({"Module":"text"})

Solution 2:

You can silent this behaviour by launching mongod instance with the following command:

mongod --setParameter failIndexKeyTooLong=false

or by executing the following command from mongoShell

db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )

If you ensured that your field will exceed the limit very rarely, then one way to solve this issue is by splitting your field (that causes index out of limit) into parts by byte length < 1KB e.g. for field val I would split it into tuple of fields val_1, val_2 and so on. Mongo stores text as utf-8 valid values. It means that you need a function that can split utf-8 strings properly.

   def split_utf8(s, n):
    """
    (ord(s[k]) & 0xc0) == 0x80 - checks whether it is continuation byte (actual part of the string) or jsut header indicates how many bytes there are in multi-byte sequence

    An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:

    With the high bit set to 0, it's a single byte value.
    With the two high bits set to 10, it's a continuation byte.
    Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).
    """
    s = s.encode('utf-8')
    while len(s) > n:
        k = n
        while (ord(s[k]) & 0xc0) == 0x80:
            k -= 1
        yield s[:k]
        s = s[k:]
    yield s

Then you can define your compound index:

db.coll.ensureIndex({val_1: 1, val_2: 1, ...}, {background: true})

or multiple indexes per each val_i:

db.coll.ensureIndex({val_1: 1}, {background: true})
db.coll.ensureIndex({val_1: 2}, {background: true})
...
db.coll.ensureIndex({val_1: i}, {background: true})

Important: If you consider using your field in compound index, then be careful with the second argument for split_utf8 function. At each document you need to remove sum of bytes of each field value that comprise your index key e.g. for index (a:1, b:1, val: 1) 1024 - sizeof(value(a)) - sizeof(value(b))

In any other cases use either hash or text indexes.