Howt to convert utf-8 byte offsets to utf-8 character offsets

I need to post-process the output of a legacy tool that reports utf-8 byte offsets instead of utf-8 character offsets. For example, it'll report [0, 1, 3, 4, 6] instead of [0, 1, 2, 3, 4] for the 5 characters in the seven-byte utf-8 string 'aβgδe', because the Greek letters 'β' and 'δ' are encoded as two-byte-sequences. (The actual text may also contain 3-byte and 4-byte utf-8 sequences.)

Are there any built-in Python functions that I could use to convert utf-8 byte offsets to utf-8 character offsets?


Solution 1:

I don't think there is a built-in or std-lib utility for this, but you can write your own small function to create a byte-offset-to-codepoint-offset mapping.

Naive approach

import typing as t

def map_byte_to_codepoint_offset(text: str) -> t.Dict[int, int]:
    mapping = {}
    byte_offset = 0
    for codepoint_offset, character in enumerate(text):
        mapping[byte_offset] = codepoint_offset
        byte_offset += len(character.encode('utf8'))
    return mapping

Let's test this with your example:

>>> text = 'aβgδe'
>>> byte_offsets = [0, 1, 3, 4, 6]
>>> mapping = map_byte_to_codepoint_offset(text)
>>> mapping
{0: 0, 1: 1, 3: 2, 4: 3, 6: 4}
>>> [mapping[o] for o in byte_offsets]
[0, 1, 2, 3, 4]

Optimisation

I haven't benchmarked this, but it's probably not very efficient to call .encode() on every character separately. Furthermore, we're only interested in the byte length of the encoded character, which can only take one of four values corresponding to a contiguous range of codepoints each. To get these ranges, one can study the UTF-8 encoding specs, look them up on the internet, or run a quick calculation in the Python REPL:

>>> import sys
>>> bins = {i: [] for i in (1, 2, 3, 4)}
>>> for codepoint in range(sys.maxunicode+1):
...     # 'surrogatepass' required to allow encoding surrogates in UTF-8
...     length = len(chr(codepoint).encode('utf8', errors='surrogatepass'))
...     bins[length].append(codepoint)
...
>>> for l, cps in bins.items():
...     print(f'{l}: {hex(min(cps))}..{hex(max(cps))}')
...
1: 0x0..0x7f
2: 0x80..0x7ff
3: 0x800..0xffff
4: 0x10000..0x10ffff

Furthermore, the mapping returned in the naive approach contains gaps: if we look up an offset that is in the middle of a multi-byte character, we will get a KeyError (eg. there's no key 2 in the above example). To avoid this, we can fill the gaps by repeating the codepoint offsets. Since the resulting indices will be successive integers starting from 0, we can use a list instead of a dict for the mapping.

TWOBYTES = 0x80
THREEBYTES = 0x800
FOURBYTES = 0x10000

def map_byte_to_codepoint_offset(text: str) -> t.List[int]:
    mapping = []
    for codepoint_offset, character in enumerate(text):
        mapping.append(codepoint_offset)
        codepoint = ord(character)
        for cue in (TWOBYTES, THREEBYTES, FOURBYTES):
            if codepoint >= cue:
                mapping.append(codepoint_offset)
            else:
                break
    return mapping

With the example from above:

>>> mapping = map_byte_to_codepoint_offset(text)
>>> mapping
[0, 1, 1, 2, 3, 3, 4]
>>> [mapping[o] for o in byte_offsets]
[0, 1, 2, 3, 4]