Why are empty strings returned in split() results?
What is the point of '/segment/segment/'.split('/')
returning ['', 'segment', 'segment', '']
?
Notice the empty elements. If you're splitting on a delimiter that happens to be at position one and at the very end of a string, what extra value does it give you to have the empty string returned from each end?
str.split
complements str.join
, so
"/".join(['', 'segment', 'segment', ''])
gets you back the original string.
If the empty strings were not there, the first and last '/'
would be missing after the join()
.
More generally, to remove empty strings returned in split()
results, you may want to look at the filter
function.
Example:
f = filter(None, '/segment/segment/'.split('/'))
s_all = list(f)
returns
['segment', 'segment']
There are two main points to consider here:
- Expecting the result of
'/segment/segment/'.split('/')
to be equal to['segment', 'segment']
is reasonable, but then this loses information. Ifsplit()
worked the way you wanted, if I tell you thata.split('/') == ['segment', 'segment']
, you can't tell me whata
was. - What should be the result of
'a//b'.split()
be?['a', 'b']
?, or['a', '', 'b']
? I.e., shouldsplit()
merge adjacent delimiters? If it should, then it will be very hard to parse data that's delimited by a character, and some of the fields can be empty. I am fairly sure there are many people who do want the empty values in the result for the above case!
In the end, it boils down to two things:
Consistency: if I have n
delimiters, in a
, I get n+1
values back after the split()
.
It should be possible to do complex things, and easy to do simple things: if you want to ignore empty strings as a result of the split()
, you can always do:
def mysplit(s, delim=None):
return [x for x in s.split(delim) if x]
but if one doesn't want to ignore the empty values, one should be able to.
The language has to pick one definition of split()
—there are too many different use cases to satisfy everyone's requirement as a default. I think that Python's choice is a good one, and is the most logical. (As an aside, one of the reasons I don't like C's strtok()
is because it merges adjacent delimiters, making it extremely hard to do serious parsing/tokenization with it.)
There is one exception: a.split()
without an argument squeezes consecutive white-space, but one can argue that this is the right thing to do in that case. If you don't want the behavior, you can always to a.split(' ')
.