Splitting a string into words and punctuation
This is more or less the way to do it:
>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats:
- The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
- This will not work with (single) quotes in the string.
- Put any additional punctuation marks you want to use in the right half of the regular expression.
- Anything not explicitely mentioned in the re is silently dropped.
Here is a Unicode-aware version:
re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']
); the second catches individual non-word characters, ignoring whitespace.
Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']
). This appears to be standard in NLP, so I consider it a feature.