Extract last word from a sentence minus any end punctuation

Solution 1:

You are definitely on the right track with the RIGHT() and SUBSTITUTE() functions.

Let's walk through the process of generalising and combining the formulas.

For this, we'll work with the sentence What to do? in A1, your RIGHT() formula in B1, and your SUBSTITUTION() formula in C1. Thus, your formulas would be =RIGHT(A1,3) and =SUBSTITUTE(B1,"?",""):

Workbook Screenshot

Step 1

We need to generalise:

<1> RIGHT(A1,3)

The 3, which is the length of the last word including any following punctuation, needs to be modified to work with any sentence. One way to do that is to realise that the last word length can be calculated from the length of the sentence and the one-based index of the last space.

For our example, the length is 11 and the index of the last space is 8,

What to do?
       ↑  ↑
12345678  11

and 11 - 8 → 3

The length can be calculated with:

<2> LEN(A1)

The index is a bit trickier, as FIND() can only be used to find the index of the first space, not any other one. However, we can use SUBSTITUTE() to help out as it does allow us to substitute any specific space. If we change the last space to a character that doesn't exist elsewhere in the sentence, we could then use FIND() to find it, and thus the index.

The formula to do the substitution is

<3>  SUBSTITUTE(A1," ","§",2)

where 2 is the occurrence number of the last space, i.e. the 2nd space, and hopefully § doesn't exist.

We can then use <3> to find the index of the last space like so:

<4>  FIND("§",SUBSTITUTE(A1," ","§",2))

But how do we work out the 2? That's a little tricky. Notice that the occurrence number of the last space in the sentence is the same as the count of the spaces in the sentence.

The count can be calculated by removing the spaces from the sentence. That reduces its length by the count. From this and the original length, we can obtain the count:

<5>  LEN(A1)-LEN(SUBSTITUTE(A1," ",""))

Plugging this back into <4> leads to the formula for the index of the last space:

<6>  FIND("§",SUBSTITUTE(A1," ","§",LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))

Subtracting this from <2> leads to the length of the last word:

<7>  LEN(A1)-FIND("§",SUBSTITUTE(A1," ","§",LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))

Plugging <7> back into <1> leads to the generalised

<8>  RIGHT(A1,LEN(A1)-FIND("§",SUBSTITUTE(A1," ","§",LEN(A1)-LEN(SUBSTITUTE(A1," ","")))))


Step 2

We need to generalise:

<9> SUBSTITUTE(B1,"?","")

Now, since SUBSTITUTE() returns the unmodified target string if the search string can't be found in it, we can "chain" more than one of this function to remove multiple possible punctuation characters. For example, to also remove a ! we would do this:

<10> SUBSTITUTE(<9>,"!","")

Plugging <9> into <10> gives:

<11> SUBSTITUTE(SUBSTITUTE(B1,"?",""),"!","")

Since there are only three punctuation characters at the end of a sentence, ?, !, and . (I hope), <10> can be fully generalised to this:

<12>  SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(B1,"?",""),"!",""),".","")


Step 3

Combining the two formula is a simple matter of replacing the B1 in <12> with <8>, which leads to the solution:

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(RIGHT(A1,LEN(A1)-FIND("§",SUBSTITUTE(A1," ","§",LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))),"?",""),"!",""),".","")

Notes:

If there is no punctuation at the end of the sentence (which is the case for sentence 21 seen in your screenshot), the SUBSTITUTE() functions effectively do nothing and the last word is correctly returned.

If the sentences in your data set can end in some other character, say like :, the formula can be easily modified to cater for this.

Finally, since there is always a space between the sentence number and the first word, there is no need for error checking in the formula. It will however return an error if there is no sentence.

Solution 2:

2018 means Excel 2016 unless otherwise mentioned so FILTERXML() was available. It can b used in a helpful way in that if one has a well-behaved string, one can separate it into individual pieces using the delimiter that lets it be called well-behaved. The spaces between words in this case.

So that can give an array of words in each sentence. It has the feature of being able to have an individual element of the array chosen and more helpfully, one can also directly specify that the one selected be the last() one, so no need to count elements. That wouldn't've been hard using the LEN() of the string beforehand minus its LEN() after SUBSTITUTE() removed the delimiters, but still, better to not even need to do that much!

So that gets you the last word plus any punctuation. To remove the punctuation you can just take the RIGHT()-most character of the last word and compare it to a list of punctuation you think you might see. Use an IF() wrapped around it to return a 1 for any match (one of them, obviously!) and a 2 for all failures and wrap that all with a SUM() that not only will give you an odd or even value, but also collapses the array the IF() generated into the summed value, not a passel of 2's with or without a 1. Use MOD() with 2 as the divisor to check even (no punctuation, or odd, and either present the output or shave the last character off with Left() and you have it:

=IF(MOD(SUM(IF(RIGHT(FILTERXML("<Outer><Inner>"&SUBSTITUTE(A1," ","</Inner><Inner>")&"</Inner></Outer>","/Outer/Inner[last()]"),1)={".",",",";"},1,2)),2)=0,FILTERXML("<Outer><Inner>"&SUBSTITUTE(A1," ","</Inner><Inner>")&"</Inner></Outer>","/Outer/Inner[last()]"),LEFT(FILTERXML("<Outer><Inner>"&SUBSTITUTE(A1," ","</Inner><Inner>")&"</Inner></Outer>","/Outer/Inner[last()]"),LEN(FILTERXML("<Outer><Inner>"&SUBSTITUTE(A1," ","</Inner><Inner>")&"</Inner></Outer>","/Outer/Inner[last()]"))-1))

Looks HORRIBLE, but that's just because FILTERXML("<Outer><Inner>"&SUBSTITUTE(A1," ","</Inner><Inner>")&"</Inner></Outer>","/Outer/Inner[last()]") occurs four times. If you replace it with "LAST" it reads much more nicely:

=IF(MOD(SUM(IF(RIGHT(LAST,1)={".",",",";"},1,2)),2)=0,LAST,LEFT(LAST,LEN(LAST)-1))

(The above doesn't work, of course, just shows the "bones" of the formula.)

See the string of punctuation in it? You just add anything you think you might encounter to that string. Or set some cells aside somewhere and enter one punctuation character after another in them, then use that range instead of the array constant like above.

Moving forward to 2021 and the LET() function, you could use this:

=LET(Punctuation,{".",",",";"},  Word,FILTERXML("<Outer><Inner>"&SUBSTITUTE(A1," ","</Inner><Inner>")&"</Inner></Outer>","/Outer/Inner[last()]"),IF(MOD(SUM(IF(RIGHT(Word,1)=Punctuation,1,2)),2)=0,Word,LEFT(Word,LEN(Word)-1)))

This has the advantage of being much more readable (in Excel, not here) because the FILTERXML() part appears only once, but also the advantage that the punctuation array appears at the very start, easy to add to if desired.

Of course, LAMDA() cometh, so someone may add something more soon. (I hope soon...)