longest common substring in R finding non-contiguous matches between the two strings

Solution 1:

Here are three possible solutions.

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

So the M values denote straight across matches. We can go and get the substrings with stri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.

Solution 2:

Leveraging @RichardScriven's insight that adist could be used (it calculates "approximate string distance". I made a function to be more comprehensive. Please note "trafos" stands for the "transformations" used to determine the "distance" between two strings (example at bottom)

EDIT This answer can produce wrong/unexpected results; as pointed out by @wdkrnls:

I ran your function against "apple" and "big apple bagels" and it returned "appl". I would have expected "apple".

See the explanation below for the wrong result. We start with a function to get the longest_string in a list:

longest_string <- function(s){return(s[which.max(nchar(s))])}

Then we can use @RichardSriven's work and the stringi library:

library(stringi)
lcsbstr <- function(a,b) { 
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr) 
}

Or we can rewrite our code to avoid the use of any external libraries (still using R's native adist function):

lcsbstr_no_lib <- function(a,b) { 
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr ) 
}

Both above functions identify only 'hello ' as the longest common substring, instead of 'hello r' (regardless of which argument is the longer of the two):

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

LAST EDIT Note some odd behavior with this result:

lcsbstr('hello world', 'hello')
#[1] 'hell'

I was expecting 'hello', but since the transformation actually moves (via deletion) the "o" in world to become the "o" in hello -- only the hell part is considered a match according to the M:

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

This behavior is observed using this Levenstein tool -- it gives two possible solutions, equivalent to these two transformations

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

I don't know if we can configure adist to prefer one solution over another? (the transformations have the same "weight" -- the same number of "M" and "D"'s -- don't know how to prefer the transformations with the greater number of consecutive M)

Finally, don't forget adist allows you to pass in ignore.case = TRUE (FALSE is the default)

Key to the "trafos" property of adist; the "transformations" to get from one string to another:

the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements M, I, D and S indicating a match, insertion, deletion and substitution

Solution 3:

I'm not sure what you did to get your output of "hello". Based on trial-and-error experiments below, it appears that the LCS function will (a) not regard a string as an LCS if a character follows what would otherwise be an LCS; (b) find multiple, equally-long LCS's (unlike sub() that finds just the first); (c) the order of the elements in the strings doesn't matter -- which has no illustration below; and (b) the order of the string in the LCS call doesn't matter -- also not shown.

So, your "hello" of a had no LCS in b since the "hel" of b was followed by a character. Well, that's my current hypothesis.

Point A above:

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

Point B above:

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

Solution 4:

df <- data.frame(A. = c("Australia", "Network"),
                 B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)

 auxFun <- function(x) {

   a <- strsplit(x[[1]], "")[[1]]
   b  <- strsplit(x[[2]], "")[[1]]
   lastchar <- suppressWarnings(which(!(a == b)))[1] - 1

   if(lastchar > 0){
     out <- paste0(a[1:lastchar], collapse = "")
   } else {
     out <- ""
   }

   return(out)
 }

 df$C. <- apply(df, 1, auxFun)

 df
 A.         B.    C.
 1 Australia    Austria Austr
 2   Network Netconnect   Net

longest common substring in R finding non-contiguous matches between the two strings

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts