Deleting duplicated words in a very large words list

I'm a beginner at this and I wrote a program that generates a wordlist following specific algorithms. The problem is it makes duplications.

So I'm looking for a way to make the code iterates through the range given or the number of words given to make without duplicating words.

OR write another program that goes through the words list the first program made and delete any duplicated words in that file which is going to take time but is worth it.

The words that should be generated should be like this one X4K7GB9y, 8 characters in length, following the rule [A-Z][0-9][A-Z][0-9][A-Z][A-Z][0-9][a-z], and the code is this:

import random
import string

random.seed(0)
NUM_WORDS = 100000000

with open("wordlist.txt", "w", encoding="utf-8") as ofile:     
    for _ in range(NUM_WORDS): 
        uppc = random.sample(string.ascii_uppercase, k=4)
        lowc = random.sample(string.ascii_lowercase, k=1) 
        digi = random.sample(string.digits, k=3) 
        word = uppc[0] + digi[0] + uppc[1] + digi[1] + uppc[2] + uppc[3] + digi[2] + lowc[0] 
        print(word, file=ofile)

I'll appreciate it if you can modify the code to not make duplications or write another code that checks the wordlist for duplications and deletes them. Thank you so much in advance

Solution 1:

Given that your algorithm creates a list of words(unique or not). You could use set to retain only the unique words, look at the example below.

word_list = ["word1", "word2", "word3", "word1"]
unique_words = set(word_list)

It returns the unique_words list that includes only ["word1", "word2", "word3"].

Deleting duplicated words in a very large words list

Solution 1:

Related

Recent Posts