Longest common prefix of two strings in bash
I have two strings. For the sake of the example they are set like this:
string1="test toast"
string2="test test"
What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.
# So I look for the command
command "$string1" "$string2"
# that outputs:
"test t"
If the strings were string1="atest toast"; string2="test test"
they would have no overlap since the check starts from the beginning and the "a" at the start of string1
.
In sed, assuming the strings don't contain any newline characters:
string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'
An improved version of the sed example, this finds the common prefix of N strings (N>=0):
string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n\1.*$/\1\n\1/;D'
If the strings are stored in an array, they can be piped to sed with printf:
strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
You can also use a here-string:
strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}' <<<"${strings[*]}")
The here-string (as with all redirections) can go anywhere within a simple command.
Yet another variant, using GNU grep:
$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K\1' <<< "$string1"$'\n'"$string2"
test t
This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.
longest_common_prefix () {
local prefix= n
## Truncate the two strings to the minimum of their lengths
if [[ ${#1} -gt ${#2} ]]; then
set -- "${1:0:${#2}}" "$2"
else
set -- "$1" "${2:0:${#1}}"
fi
## Binary search for the first differing character, accumulating the common prefix
while [[ ${#1} -gt 1 ]]; do
n=$(((${#1}+1)/2))
if [[ ${1:0:$n} == ${2:0:$n} ]]; then
prefix=$prefix${1:0:$n}
set -- "${1:$n}" "${2:$n}"
else
set -- "${1:0:$n}" "${2:0:$n}"
fi
done
## Add the one remaining character, if common
if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
printf %s "$prefix"
}
The standard toolbox includes cmp
to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp
produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.
longest_common_prefix () {
local LC_ALL=C offset prefix
offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
if [[ -n $offset ]]; then
offset=${offset%,*}; offset=${offset##* }
prefix=${1:0:$((offset-1))}
else
if [[ ${#1} -lt ${#2} ]]; then
prefix=$1
else
prefix=$2
fi
fi
printf %s "$prefix"
}
Note that cmp
operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.
longest_common_prefix () {
local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32)
<(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
if [[ -n $offset ]]; then
offset=${offset%,*}; offset=${offset##* }
prefix=${1:0:$((offset/4-1))}
else
if [[ ${#1} -lt ${#2} ]]; then
prefix=$1
else
prefix=$2
fi
fi
printf %s "$prefix"
}
Grep short variant (idea borrowed from sed one):
$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n\1)'
String
Assumes string have no new line character. But easy may be tuned to use any delimiter.
Update at 2016-10-24: On modern versions of grep you may receive complain grep: unescaped ^ or $ not supported with -Pz
, just use \A
instead of ^
:
$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n\1)'
String