Can a stemming dictionary be used as rejection criteria in R?
I am struggling through some text analysis, and I'm not sure I'm doing the stemming correctly. Right now, my command for single-term stemming is
text_stem <- text_clean %>% mutate(stem = wordStem(word, language = "english"))
Is it possible to use this not only as a stemmer, but as a filter? For example, if "text_clean" contains the word aksdjhgla and that word is not in whatever SnowballC uses as a dictionary, the stemmed text would reject it? Maybe there's another command that does this kind of filtering?
wordStem
does not employ a dictionary but uses grammatical rules to do stemming (which is a rather crude approximation to lemmatisation btw). Here is an example:
words <- c("win", "winning")
words2 <- c("aksdjhglain", "aksdjhglainning")
SnowballC::wordStem(words, language = "english")
#> [1] "win" "win"
SnowballC::wordStem(words2, language = "english")
#> [1] "aksdjhglain" "aksdjhglain"
As you can see, wordStem
does exactly the same, no matter if the words actually exist or are complete rubbish. All that matters are the word endings (ie stems). As @Kat suggested, you probably want to look at the hunspell
package which actually uses dictionaries. To find out which words exist in the dictionary, use hunspell_check
:
hunspell::hunspell_check(c(words, words2))
#> [1] TRUE TRUE FALSE FALSE
Inside your existing code, you could use this to remove misspelled words:
text_stem <- text_clean %>%
mutate(stem = wordStem(word, language = "english")) %>%
filter(hunspell::hunspell_check(word), dict = dictionary("en_US"))