Removing list of words from a string

I have a list of stopwords. And I have a search string. I want to remove the words from the string.

As an example:

stopwords=['what','who','is','a','at','is','he']
query='What is hello'

Now the code should strip 'What' and 'is'. However in my case it strips 'a', as well as 'at'. I have given my code below. What could I be doing wrong?

for word in stopwords:
    if word in query:
        print word
        query=query.replace(word,"")

If the input query is "What is Hello", I get the output as:
wht s llo

Why does this happen?


Solution 1:

This is one way to do it:

query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()

resultwords  = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

print(result)

I noticed that you want to also remove a word if its lower-case variant is in the list, so I've added a call to lower() in the condition check.

Solution 2:

the accepted answer works when provided a list of words separated by spaces, but that's not the case in real life when there can be punctuation to separate the words. In that case re.split is required.

Also, testing against stopwords as a set makes lookup faster (even if there's a tradeoff between string hashing & lookup when there's a small number of words)

My proposal:

import re

query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}

resultwords  = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)

output (as list of words):

['hello','Says','']

There's a blank string in the end, because re.split annoyingly issues blank fields, that needs filtering out. 2 solutions here:

resultwords  = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords]  # filter out empty words

or add empty string to the list of stopwords :)

stopwords = {'what','who','is','a','at','is','he',''}

now the code prints:

['hello','Says']

Solution 3:

building on what karthikr said, try

' '.join(filter(lambda x: x.lower() not in stopwords,  query.split()))

explanation:

query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]

filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
                      # filters it based on the function which will take in one item at
                      # a time and return true.false

lambda x: x.lower() not in stopwords   # anonymous function that takes in variable,
                                       # converts it to lower case, and returns true if
                                       # the word is not in the iterable stopwords


' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
                   #using the string/char in front of the dot, i.e. ' ' as a joiner.
                   # i.e. ["What", "is","hello"] -> "What is hello"