Splitting a string into words and punctuation

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
This will not work with (single) quotes in the string.
Put any additional punctuation marks you want to use in the right half of the regular expression.
Anything not explicitely mentioned in the re is silently dropped.

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

Splitting a string into words and punctuation

Related

Recent Posts