Extract all words, treating hyphenated words as single words [duplicate]

I have this tokenizer I found on another stack question, however, I need to modify it and am struggling. It currently splits hyphenated words into separate tokens but I want them to be single tokens.

tokenizer:

[(m.start(0), m.end(0),m.group()) for m in re.finditer("\w+|\$[\d\.]+|\S+",target_sentence)]

given the following sentence: "half-life is a single token" it should give the following tokens (plus the character offset info):

['half-life', 'is', 'a', 'single', 'token']

Instead it gives:

[(0, 4, 'half'),
(4, 9, '-life'),
(10, 12, 'is'),
(13, 14, 'a'),
(15, 21, 'single'),
(22, 27, 'token')]

EDIT: I want the character info not just word tokens so string.split is not going to cut it

Solution 1:

Your regex is matching half using \w+ and matching remaining -life using last alternate \S+.

You may use this regex to capture optional hyphenated words:

\w+(?:-\w+)*|\$[\d.]+|\S+

RegEx Demo

\w(?:-\w+)* will match 1 or more words separated by hyphen.

Extract all words, treating hyphenated words as single words [duplicate]

Solution 1:

Related

Recent Posts