Iterate over individual bytes in Python 3
When iterating over a bytes
object in Python 3, one gets the individual bytes
as ints
:
>>> [b for b in b'123']
[49, 50, 51]
How to get 1-length bytes
objects instead?
The following is possible, but not very obvious for the reader and most likely performs bad:
>>> [bytes([b]) for b in b'123']
[b'1', b'2', b'3']
Solution 1:
If you are concerned about performance of this code and an int
as a byte is not suitable interface in your case then you should probably reconsider data structures that you use e.g., use str
objects instead.
You could slice the bytes
object to get 1-length bytes
objects:
L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]
There is PEP 0467 -- Minor API improvements for binary sequences that proposes bytes.iterbytes()
method:
>>> list(b'123'.iterbytes())
[b'1', b'2', b'3']
Solution 2:
int.to_bytes
int
objects have a to_bytes method which can be used to convert an int to its corresponding byte:
>>> import sys
>>> [i.to_bytes(1, sys.byteorder) for i in b'123']
[b'1', b'2', b'3']
As with some other other answers, it's not clear that this is more readable than the OP's original solution: the length and byteorder arguments make it noisier I think.
struct.unpack
Another approach would be to use struct.unpack, though this might also be considered difficult to read, unless you are familiar with the struct module:
>>> import struct
>>> struct.unpack('3c', b'123')
(b'1', b'2', b'3')
(As jfs observes in the comments, the format string for struct.unpack
can be constructed dynamically; in this case we know the number of individual bytes in the result must equal the number of bytes in the original bytestring, so struct.unpack(str(len(bytestring)) + 'c', bytestring)
is possible.)
Performance
>>> import random, timeit
>>> bs = bytes(random.randint(0, 255) for i in range(100))
>>> # OP's solution
>>> timeit.timeit(setup="from __main__ import bs",
stmt="[bytes([b]) for b in bs]")
46.49886950897053
>>> # Accepted answer from jfs
>>> timeit.timeit(setup="from __main__ import bs",
stmt="[bs[i:i+1] for i in range(len(bs))]")
20.91463226894848
>>> # Leon's answer
>>> timeit.timeit(setup="from __main__ import bs",
stmt="list(map(bytes, zip(bs)))")
27.476876026019454
>>> # guettli's answer
>>> timeit.timeit(setup="from __main__ import iter_bytes, bs",
stmt="list(iter_bytes(bs))")
24.107485140906647
>>> # user38's answer (with Leon's suggested fix)
>>> timeit.timeit(setup="from __main__ import bs",
stmt="[chr(i).encode('latin-1') for i in bs]")
45.937552741961554
>>> # Using int.to_bytes
>>> timeit.timeit(setup="from __main__ import bs;from sys import byteorder",
stmt="[x.to_bytes(1, byteorder) for x in bs]")
32.197659170022234
>>> # Using struct.unpack, converting the resulting tuple to list
>>> # to be fair to other methods
>>> timeit.timeit(setup="from __main__ import bs;from struct import unpack",
stmt="list(unpack('100c', bs))")
1.902243083808571
struct.unpack
seems to be at least an order of magnitude faster than other methods, presumably because it operates at the byte level. int.to_bytes
, on the other hand, performs worse than most of the "obvious" approaches.
Solution 3:
I thought it might be useful to compare the runtimes of the different approaches so I made a benchmark (using my library simple_benchmark
):
Probably unsurprisingly the NumPy solution is by far the fastest solution for large bytes object.
But if a resulting list is desired then both the NumPy solution (with the tolist()
) and the struct
solution are much faster than the other alternatives.
I didn't include guettlis answer because it's almost identical to jfs solution just instead of a comprehension a generator function is used.
import numpy as np
import struct
import sys
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
@b.add_function()
def jfs(bytes_obj):
return [bytes_obj[i:i+1] for i in range(len(bytes_obj))]
@b.add_function()
def snakecharmerb_tobytes(bytes_obj):
return [i.to_bytes(1, sys.byteorder) for i in bytes_obj]
@b.add_function()
def snakecharmerb_struct(bytes_obj):
return struct.unpack(str(len(bytes_obj)) + 'c', bytes_obj)
@b.add_function()
def Leon(bytes_obj):
return list(map(bytes, zip(bytes_obj)))
@b.add_function()
def rusu_ro1_format(bytes_obj):
return [b'%c' % i for i in bytes_obj]
@b.add_function()
def rusu_ro1_numpy(bytes_obj):
return np.frombuffer(bytes_obj, dtype='S1')
@b.add_function()
def rusu_ro1_numpy_tolist(bytes_obj):
return np.frombuffer(bytes_obj, dtype='S1').tolist()
@b.add_function()
def User38(bytes_obj):
return [chr(i).encode() for i in bytes_obj]
@b.add_arguments('byte object length')
def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, b'a' * size
r = b.run()
r.plot()
Solution 4:
since python 3.5 you can use % formatting to bytes and bytearray:
[b'%c' % i for i in b'123']
output:
[b'1', b'2', b'3']
the above solution is 2-3 times faster than your initial approach, if you want a more fast solution I will suggest to use numpy.frombuffer:
import numpy as np
np.frombuffer(b'123', dtype='S1')
output:
array([b'1', b'2', b'3'],
dtype='|S1')
The second solution is ~10% faster than struct.unpack (I have used the same performance test as @snakecharmerb, against 100 random bytes)
Solution 5:
A trio of map()
, bytes()
and zip()
does the trick:
>>> list(map(bytes, zip(b'123')))
[b'1', b'2', b'3']
However I don't think that it is any more readable than [bytes([b]) for b in b'123']
or performs better.