Regex: Find those html tags which contains at least 3 of some keywords

I have this words, of which at least three of them are found in any sentence in English.

was, where, were, some, then, than, that, can, by, the, and, with, over, there, is, as, also, through, from, while, just, like, for, such, if, else, still, again, want, will, wish, make, made, well, have, had, has, it, be, do, say, others, go, know, see, think, look, give, use, find, tell, ask, work, seem, feel, try, leave, call, get, take, too, in, addition, to, could, who, he, she, because, of, your, yours, their, doesn't, are, an, these, this, those, but, at, whom, or, out, how, when, between, his, her, they, them, my, without, maybe, even, show, can't, must, couldn't, now, i'm, many, come, own, self, seen, it’s, we, any, other, coming, so, found, more, much, all, very, same, did, which, does, on

Also, I have these two html tags, but only the content of the first one is in English:

<meta name="description" content="Simply Red are a British soul and pop band which formed in Manchester in 1985. The lead vocalist of the band is singer and songwriter Mick Hucknall by">

and one tag in russian:

<meta name="description" content="Simply Red - британская соул- и поп-группа, образованная в Манчестере в 1985 году. Ведущим вокалистом группы является певец и автор песен Мик Хакнелл.">

So, I want to check all html files that contain tags whose content is written in English. For this, I must find those html tags which contains at least 3 of that keywords from the beginning.

My regex, with just few words (short version), looks like this:

SEARCH: (?-s)<meta name="description".+?(?:(was|is|as|on|and|in)).+>

and the larger version will be:

(?-s)<meta name="description".*?(was|where|were|some|then|than|that|can|by|the|and|with|over|there|is|as|also|through|from|while|just|like|for|such|if|else|still|again|want|will|wish|make|made|well|have|had|has|it|be|do|say|others|go|know|see|think|look|give|use|find|tell|ask|work|seem|feel|try|leave|call|get|take|too|in|addition|to|could|who|he|she|because|of|your|yours|their|doesn't|are|an|these|this|those|but|at|whom|or|out|how|when|between|his|her|they|them|my|without|maybe|even|show|can't|must|couldn't|now|i'm|many|come|own|self|seen|it’s|we|any|other|coming|so|found|more|much|all|very|same|did|which|does|on).+>

Ok, the problem is that my regex find also the second tag, whose content is written in russian. I must find only the first one (in english)


Your list is too big, so to demonstrate the technique, here is an example on a small list of four words, one two three four.

enter image description here

Here is an explanation of the search string: (one|two|three|four).*(?-1).*(?-1)

  • (one|two|three|four) : Capture one of the words in the group
  • .* : Find any number of characters
  • (?-1) : Find another match of the group one behind this one (recursive subpattern)