Count the number of all words in a string
Is there a function to count the number of words in a string? For example:
str1 <- "How many words are in this sentence"
to return a result of 7.
Use the regular expression symbol \\W
to match non-word characters, using +
to indicate one or more in a row, along with gregexpr
to find all matches in a string. Words are the number of word separators plus 1.
lengths(gregexpr("\\W+", str1)) + 1
This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W
's notion of non-word (one could work with other regular expressions, \\S+
, [[:alpha:]]
, etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit
solutions, which will allocate memory for each word. Regular expressions are described in ?regex
.
Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation
str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3
Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+
), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
Leading to
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3
Again the regular expression might be refined for different notions of 'word'.
I like the use of gregexpr()
because it's memory efficient. An alternative using strsplit()
(like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is
lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3
This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.
Most simple way would be:
require(stringr)
str_count("one, two three 4,,,, 5 6", "\\S+")
... counting all sequences on non-space characters (\\S+
).
But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6
I use the str_count
function from the stringr
library with the escape sequence \w
that represents:
any ‘word’ character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)
Example:
> str_count("How many words are in this sentence", '\\w+')
[1] 7
Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr
.
But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs"
or "Combien de mots sont dans cette phrase ?"
.
Benchmark:
library(stringr)
questions <-
c(
"", "x", "x y", "x y!", "x y! z",
"foo+bar+baz~spam+eggs",
"one, two three 4,,,, 5 6",
"How many words are in this sentence",
"How many words are in this sentence",
"Combien de mots sont dans cette phrase ?",
"
Day after day, day after day,
We stuck, nor breath nor motion;
"
)
answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)
score <- function(f) sum(unlist(lapply(questions, f)) == answers)
funs <-
c(
function(s) sapply(gregexpr("\\W+", s), length) + 1,
function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
function(s) length(str_match_all(s, "\\S+")[[1]]),
function(s) str_count(s, "\\S+"),
function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
function(s) length(unlist(strsplit(s," "))),
function(s) sapply(strsplit(s, " "), length),
function(s) str_count(s, '\\w+')
)
unlist(lapply(funs, score))
Output (11 is the maximum possible score):
6 10 10 8 9 9 7 6 6 11