How to remove all the duplicated words on every line using Notepad++?
I'm working on a file containing lines with keywords and some lines contain duplicated keywords.
For example:
dangerous,dangerous,hazardous,perilous
I want to tell Notepad++ that I want to remove every duplicated word per line. For this example dangerous,
would be removed:
dangerous,hazardous,perilous
I have a bunch of lines like that and that's why I'm looking for an automated way of doing this.
You can use a regular expression to remove consecutive duplicated words in a line, however I don't think it's possible to remove duplicated words which are not consecutive (e.g dangerous, hazardous, dangerous
).
Use this regex in the replace window in Notepad++, and don't forget to select "Regular expression" as the Search Mode option below:
This regex will remove all consecutive duplicated words - whether it's 2 duplicated words or 10 duplicated words consecutively: \b(\w+)(?:,\s+\1\b)+
.
The exact same no-commas regex would be: \b(\w+)(?:\s+\1\b)+
(might be useful for other users).
If you want a regex specifically for only two duplicated words (doubles), use this regex: (\b\w+\b)\W+\1
.
Place this regex in the Replace with box to keep one occurrence of the word (otherwise all repeated words will be removed): ${1}
.
These regular expressions will fix a situation like the one you described in your question as an example. The first regex will work for every number of duplicated words (e.g dangerous, dangerous, dangerous, dangerous, hazardous
), while the second version will only work for two duplicated words (e.g dangerous, dangerous, hazardous
).
Note: The regular expression will only apply to the format described in the question, meaning that formats like two words, two words, anotherword
, two-words, two-words, anotherword
, three words expression, three words expression, anotherword
won't be changed because the regex won't apply to them.
Here is a way to do the job, this will replace all duplicate words even if they are not contiguous:
- Ctrl+H
- Find what:
(?:^|\G)(\b\w+\b),?(?=.*\1)
- Replace with:
LEAVE EMPTY
- check Wrap around
- check Regular expression
- DO NOT CHECK
. matches newline
- Replace all
Explanation:
(?:^|\G) : non capture group, beginning of line or position of last match
(\b\w+\b) : group 1, 1 or more word character (ie. [a-zA-Z0-9_]), surrounded by word boundaries
,? : optional comma
(?=.*\1) : positive lookahead, check if thhere is the same word (contained in group 1) somewhere after
Given an input like:
dangerous,dangerous,hazardous,perilous,dangerous,dangerous,hazardous,perilous
We got:
dangerous,hazardous,perilous