how do I clear a stringio object?

Solution 1:

TL;DR

Don't bother clearing it, just create a new one—it’s faster.

The method

Python 2

Here's how I would find such things out:

>>> from StringIO import StringIO
>>> dir(StringIO)
['__doc__', '__init__', '__iter__', '__module__', 'close', 'flush', 'getvalue', 'isatty', 'next', 'read', 'readline', 'readlines', 'seek', 'tell', 'truncate', 'write', 'writelines']
>>> help(StringIO.truncate)
Help on method truncate in module StringIO:

truncate(self, size=None) unbound StringIO.StringIO method
    Truncate the file's size.

    If the optional size argument is present, the file is truncated to
    (at most) that size. The size defaults to the current position.
    The current file position is not changed unless the position
    is beyond the new file size.

    If the specified size exceeds the file's current size, the
    file remains unchanged.

So, you want .truncate(0). But it's probably cheaper (and easier) to initialise a new StringIO. See below for benchmarks.

Python 3

(Thanks to tstone2077 for pointing out the difference.)

>>> from io import StringIO
>>> dir(StringIO)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', 'close', 'closed', 'detach', 'encoding', 'errors', 'fileno', 'flush', 'getvalue', 'isatty', 'line_buffering', 'newlines', 'read', 'readable', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']
>>> help(StringIO.truncate)
Help on method_descriptor:

truncate(...)
    Truncate size to pos.

    The pos argument defaults to the current file position, as
    returned by tell().  The current file position is unchanged.
    Returns the new absolute position.

It is important to note with this that now the current file position is unchanged, whereas truncating to size zero would reset the position in the Python 2 variant.

Thus, for Python 2, you only need

>>> from cStringIO import StringIO
>>> s = StringIO()
>>> s.write('foo')
>>> s.getvalue()
'foo'
>>> s.truncate(0)
>>> s.getvalue()
''
>>> s.write('bar')
>>> s.getvalue()
'bar'

If you do this in Python 3, you won't get the result you expected:

>>> from io import StringIO
>>> s = StringIO()
>>> s.write('foo')
3
>>> s.getvalue()
'foo'
>>> s.truncate(0)
0
>>> s.getvalue()
''
>>> s.write('bar')
3
>>> s.getvalue()
'\x00\x00\x00bar'

So in Python 3 you also need to reset the position:

>>> from cStringIO import StringIO
>>> s = StringIO()
>>> s.write('foo')
3
>>> s.getvalue()
'foo'
>>> s.truncate(0)
0
>>> s.seek(0)
0
>>> s.getvalue()
''
>>> s.write('bar')
3
>>> s.getvalue()
'bar'

If using the truncate method in Python 2 code, it's safer to call seek(0) at the same time (before or after, it doesn't matter) so that the code won't break when you inevitably port it to Python 3. And there's another reason why you should just create a new StringIO object!

Times

Python 2

>>> from timeit import timeit
>>> def truncate(sio):
...     sio.truncate(0)
...     return sio
... 
>>> def new(sio):
...     return StringIO()
... 

When empty, with StringIO:

>>> from StringIO import StringIO
>>> timeit(lambda: truncate(StringIO()))
3.5194039344787598
>>> timeit(lambda: new(StringIO()))
3.6533868312835693

With 3KB of data in, with StringIO:

>>> timeit(lambda: truncate(StringIO('abc' * 1000)))
4.3437709808349609
>>> timeit(lambda: new(StringIO('abc' * 1000)))
4.7179079055786133

And the same with cStringIO:

>>> from cStringIO import StringIO
>>> timeit(lambda: truncate(StringIO()))
0.55461597442626953
>>> timeit(lambda: new(StringIO()))
0.51241087913513184
>>> timeit(lambda: truncate(StringIO('abc' * 1000)))
1.0958449840545654
>>> timeit(lambda: new(StringIO('abc' * 1000)))
0.98760509490966797

So, ignoring potential memory concerns (del oldstringio), it's faster to truncate a StringIO.StringIO (3% faster for empty, 8% faster for 3KB of data), but it's faster ("fasterer" too) to create a new cStringIO.StringIO (8% faster for empty, 10% faster for 3KB of data). So I'd recommend just using the easiest one—so presuming you're working with CPython, use cStringIO and create new ones.

Python 3

The same code, just with seek(0) put in.

>>> def truncate(sio):
...     sio.truncate(0)
...     sio.seek(0)
...     return sio
... 
>>> def new(sio):
...     return StringIO()
...

When empty:

>>> from io import StringIO
>>> timeit(lambda: truncate(StringIO()))
0.9706327870007954
>>> timeit(lambda: new(StringIO()))
0.8734330690022034

With 3KB of data in:

>>> timeit(lambda: truncate(StringIO('abc' * 1000)))
3.5271066290006274
>>> timeit(lambda: new(StringIO('abc' * 1000)))
3.3496507499985455

So for Python 3 creating a new one instead of reusing a blank one is 11% faster and creating a new one instead of reusing a 3K one is 5% faster. Again, create a new StringIO rather than truncating and seeking.

Solution 2:

There is something important to note (at least with Python 3.2):

seek(0) IS needed before truncate(0). Here is some code without the seek(0):

from io import StringIO
s = StringIO()
s.write('1'*3)
print(repr(s.getvalue()))
s.truncate(0)
print(repr(s.getvalue()))
s.write('1'*3)
print(repr(s.getvalue()))

Which outputs:

'111'
''
'\x00\x00\x00111'

with seek(0) before the truncate, we get the expected output:

'111'
''
'111'

Solution 3:

How I managed to optimise my processing (read in chunks, process each chunk, write processed stream out to file) of many files in a sequence is that I reuse the same cStringIO.StringIO instance, but always reset() it after using, then write to it, and then truncate(). By doing this, I'm only truncating the part at the end that I don't need for the current file. This seems to have given me a ~3% performance increase. Anybody who's more expert on this could confirm if this indeed optimises memory allocation.

sio = cStringIO.StringIO()
for file in files:
    read_file_chunks_and_write_to_sio(file, sio)
    sio.truncate()
    with open('out.bla', 'w') as f:
        f.write(sio.getvalue())
    sio.reset()