Regex to match only uppercase "words" with some exceptions

I have technical strings as the following:

"The thing P1 must connect to the J236 thing in the Foo position."

I would like to match with a regular expression those only-in-uppercase words (namely here P1 and J236). The problem is that I don't want to match the first letter of the sentence when it is a one-letter word.

Example, in:

"A thing P1 must connect ..."

I want P1 only, not A and P1. By doing that, I know that I can miss a real "word" (like in "X must connect to Y") but I can live with it.

Additionally, I don't want to match uppercase words if the sentence is all uppercase.

Example:

"THING P1 MUST CONNECT TO X2."

Of course, ideally, I would like to match the technical words P1 and X2 here but since they are "hidden" in the all-uppercase sentence and since these technical words have no specific pattern, it's impossible. Again I can live with it because all-uppercase sentences are not so frequent in my files.

Thanks!

Solution 1:

To some extent, this is going to vary by the "flavour" of RegEx you're using. The following is based on .NET RegEx, which uses \b for word boundaries. In the last example, it also uses negative lookaround (?<!) and (?!) as well as non-capturing parentheses (?:)

Basically, though, if the terms always contain at least one uppercase letter followed by at least one number, you can use

\b[A-Z]+[0-9]+\b

For all-uppercase and numbers (total must be 2 or more):

\b[A-Z0-9]{2,}\b

For all-uppercase and numbers, but starting with at least one letter:

\b[A-Z][A-Z0-9]+\b

The granddaddy, to return items that have any combination of uppercase letters and numbers, but which are not single letters at the beginning of a line and which are not part of a line that is all uppercase:

(?:(?<!^)[A-Z]\b|(?<!^[A-Z0-9 ]*)\b[A-Z0-9]+\b(?![A-Z0-9 ]$))

breakdown:

The regex starts with (?:. The ?: signifies that -- although what follows is in parentheses, I'm not interested in capturing the result. This is called "non-capturing parentheses." Here, I'm using the paretheses because I'm using alternation (see below).

Inside the non-capturing parens, I have two separate clauses separated by the pipe symbol |. This is alternation -- like an "or". The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line.

Now, let's look at each expression in the alternation.

The first expression is: (?<!^)[A-Z]\b. The main clause here is [A-Z]\b, which is any one capital letter followed by a word boundary, which could be punctuation, whitespace, linebreak, etc. The part before that is (?<!^), which is a "negative lookbehind." This is a zero-width assertion, which means it doesn't "consume" characters as part of a match -- not really important to understand that here. The syntax for negative lookbehind in .NET is (?<!x), where x is the expression that must not exist before our main clause. Here that expression is simply ^, or start-of-line, so this side of the alternation translates as "any word consisting of a single, uppercase letter that is not at the beginning of the line."

Okay, so we're matching one-letter, uppercase words that are not at the beginning of the line. We still need to match words consisting of all numbers and uppercase letters.

That is handled by a relatively small portion of the second expression in the alternation: \b[A-Z0-9]+\b. The \bs represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.

The rest of the expression consists of other lookarounds. (?<!^[A-Z0-9 ]*) is another negative lookbehind, where the expression is ^[A-Z0-9 ]*. This means what precedes must not be all capital letters and numbers.

The second lookaround is (?![A-Z0-9 ]$), which is a negative lookahead. This means what follows must not be all capital letters and numbers.

So, altogether, we are capturing words of all capital letters and numbers, and excluding one-letter, uppercase characters from the start of the line and everything from lines that are all uppercase.

There is at least one weakness here in that the lookarounds in the second alternation expression act independently, so a sentence like "A P1 should connect to the J9" will match J9, but not P1, because everything before P1 is capitalized.

It is possible to get around this issue, but it would almost triple the length of the regex. Trying to do so much in a single regex is seldom, if ever, justfied. You'll be better off breaking up the work either into multiple regexes or a combination of regex and standard string processing commands in your programming language of choice.

Solution 2:

Why do you need to do this in one monster-regex? You can use actual code to implement some of these rules, and doing so would be much easier to modify if those requirements change later.

For example:

if(/^[A-Z0-9\s]*$/)
    # sentence is all uppercase, so just fail out
    return 0;

# Carry on with matching uppercase terms

Solution 3:

Maybe you can run this regex first to see if the line is all caps:

^[A-Z \d\W]+$

That will match only if it's a line like THING P1 MUST CONNECT TO X2.

Otherwise, you should be able to pull out the individual uppercase phrases with this:

[A-Z][A-Z\d]+

That should match "P1" and "J236" in The thing P1 must connect to the J236 thing in the Foo position.

Solution 4:

Don't do things like [A-Z] or [0-9]. Do \p{Lu} and \d instead. Of course, this is valid for perl based regex flavours. This includes java.

I would suggest that you don't make some huge regex. First split the text in sentences. then tokenize it (split into words). Use a regex to check each token/word. Skip the first token from sentence. Check if all tokens are uppercase beforehand and skip the whole sentence if so, or alter the regex in this case.

Solution 5:

I'm not a regex guru by any means. But try:

<[A-Z0-9][A-Z0-9]+>

<           start of word
[A-Z0-9]    one character
[A-Z0-9]+   and one or more of them
>           end of word

I won't try for the bonus points of the whole upper case sentence. hehe