Extract all words, treating hyphenated words as single words [duplicate]
I have this tokenizer I found on another stack question, however, I need to modify it and am struggling. It currently splits hyphenated words into separate tokens but I want them to be single tokens.
tokenizer:
[(m.start(0), m.end(0),m.group()) for m in re.finditer("\w+|\$[\d\.]+|\S+",target_sentence)]
given the following sentence: "half-life is a single token" it should give the following tokens (plus the character offset info):
['half-life', 'is', 'a', 'single', 'token']
Instead it gives:
[(0, 4, 'half'),
(4, 9, '-life'),
(10, 12, 'is'),
(13, 14, 'a'),
(15, 21, 'single'),
(22, 27, 'token')]
EDIT: I want the character info not just word tokens so string.split is not going to cut it
Solution 1:
Your regex is matching half
using \w+
and matching remaining -life
using last alternate \S+
.
You may use this regex to capture optional hyphenated words:
\w+(?:-\w+)*|\$[\d.]+|\S+
RegEx Demo
\w(?:-\w+)*
will match 1 or more words separated by hyphen.