Count the number of all words in a string

Is there a function to count the number of words in a string? For example:

str1 <- "How many words are in this sentence"

to return a result of 7.

Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.

lengths(gregexpr("\\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.

Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

Most simple way would be:

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\\S+")

... counting all sequences on non-space characters (\\S+).

But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

I use the str_count function from the stringr library with the escape sequence \w that represents:

any ‘word’ character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered)

Example:

> str_count("How many words are in this sentence", '\\w+')
[1] 7

Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.

But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".

Benchmark:

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\\S+")[[1]]),
    function(s) str_count(s, "\\S+"),
    function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\\w+')
  )

unlist(lapply(funs, score))

Output (11 is the maximum possible score):

6 10 10  8  9  9  7  6  6 11

Count the number of all words in a string

Related

Recent Posts