Why is True returned when checking if an empty string is in another?
My limited brain cannot understand why this happens:
>>> print '' in 'lolsome'
True
In PHP, a equivalent comparison returns false:
var_dump(strpos('', 'lolsome'));
From the documentation:
For the Unicode and string types,
x in y
is true if and only if x is a substring of y. An equivalent test isy.find(x) != -1
. Note, x and y need not be the same type; consequently,u'ab' in 'abc'
will returnTrue
. Empty strings are always considered to be a substring of any other string, so"" in "abc"
will returnTrue
.
From looking at your print
call, you're using 2.x.
To go deeper, look at the bytecode:
>>> def answer():
... '' in 'lolsome'
>>> dis.dis(answer)
2 0 LOAD_CONST 1 ('')
3 LOAD_CONST 2 ('lolsome')
6 COMPARE_OP 6 (in)
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
COMPARE_OP
is where we are doing our boolean operation and looking at the source code for in
reveals where the comparison happens:
TARGET(COMPARE_OP)
{
w = POP();
v = TOP();
if (PyInt_CheckExact(w) && PyInt_CheckExact(v)) {
/* INLINE: cmp(int, int) */
register long a, b;
register int res;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
switch (oparg) {
case PyCmp_LT: res = a < b; break;
case PyCmp_LE: res = a <= b; break;
case PyCmp_EQ: res = a == b; break;
case PyCmp_NE: res = a != b; break;
case PyCmp_GT: res = a > b; break;
case PyCmp_GE: res = a >= b; break;
case PyCmp_IS: res = v == w; break;
case PyCmp_IS_NOT: res = v != w; break;
default: goto slow_compare;
}
x = res ? Py_True : Py_False;
Py_INCREF(x);
}
else {
slow_compare:
x = cmp_outcome(oparg, v, w);
}
Py_DECREF(v);
Py_DECREF(w);
SET_TOP(x);
if (x == NULL) break;
PREDICT(POP_JUMP_IF_FALSE);
PREDICT(POP_JUMP_IF_TRUE);
DISPATCH();
}
and where cmp_outcome is in the same file, it's easy to find our next clue:
res = PySequence_Contains(w, v);
which is in abstract.c:
{
Py_ssize_t result;
if (PyType_HasFeature(seq->ob_type, Py_TPFLAGS_HAVE_SEQUENCE_IN)) {
PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
if (sqm != NULL && sqm->sq_contains != NULL)
return (*sqm->sq_contains)(seq, ob);
}
result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}
and to come up for air from the source, we find this next function in the documentation:
objobjproc PySequenceMethods.sq_contains
This function may be used by
PySequence_Contains()
and has the same signature. This slot may be left to NULL, in this casePySequence_Contains()
simply traverses the sequence until it finds a match.
and further down in the same documentation:
int PySequence_Contains(PyObject *o, PyObject *value)
Determine if o contains value. If an item in o is equal to value, return
1
, otherwise return0
. On error, return-1
. This is equivalent to the Python expressionvalue in o
.
Where ''
isn't null
, the sequence 'lolsome'
can be thought to contain it.
Quoting from the PHP's strpos
documentation,
mixed strpos ( string $haystack , mixed $needle [, int $offset = 0 ] )
Find the numeric position of the first occurrence of
needle
in thehaystack
string.
So what you have actually tried is similar to the Python construct seen below
>>> print 'lolsome' in ''
False
So, you should actually have written like shown below to have the corresponding comparison in PHP
var_dump(strpos('lolsome', ''));
Even then it issues a warning and returns false
.
PHP Warning:
strpos()
: Empty needle in /home/thefourtheye/Desktop/Test.php on line 3
bool(false)
I dug deeper and found the source code corresponding to the strpos
function,
if (!Z_STRLEN_P(needle)) {
php_error_docref(NULL, E_WARNING, "Empty needle");
RETURN_FALSE;
}
They consider the empty string being searched as a problematic case. So, they are issuing a warning and returning false
. Apart from this I couldn't find any document discussing why it is being considered as a problem.
As far as Python is concerned, this behaviour is well defined in the Comparisons section,
Empty strings are always considered to be a substring of any other string, so
"" in "abc"
will returnTrue
.
Basically, from math:
The empty set is a subset of every set
The same logic works here. You can consider ''
an empty set. And therefore, it's a subset of every string set, since they must be the same type.
>>> a = ""
>>> b = "Python"
>>> a in b
True
>>> set(a).issubset(b)
True
>>> a = set() #empty set
>>> b = set([1,2,3])
>>> a.issubset(b)
True
>>>
But be careful! A subset and a membership are different things.
The empty string is the unique string of length zero.
The empty string is the identity element of the concatenation operation.
The empty string precedes any other string under lexicographical order, because it is the shortest of all strings.
The empty string is a legitimate string, upon which most string operations should work.
Wikipedia
> strlen("");
=> 0
> "a" . "" == "a";
=> true
> "" . "a" == "a";
=> true
> "" < "\0";
=> true
From above, it seems PHP treats the empty string as a valid string.
> strstr("lolsome", "");
strstr(): Empty needle :1
But it doesn't seem to consider the empty string as fully legitimate one. Most probably PHP is the only language which doesn't allow the substring to be searched within a string to be an empty string.
Is it a defensive mechanism? Obviously, programmers don't have to protect the needle with if
. If so, why other languages allow this test to pass!!! Language designers have to answer
What's a Python string made up of?
>>> ''.count('')
1
Obviously The empty string has one empty string.
>>> 'a'.count('')
2
One element string has two empty srings.
>>> 'ab'.count('')
3
So it seems Python string is concatenation of one element strings. Each element in a string is sandwiched between two empty strings.
>>> "lolsome".split('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: empty separator
But here Python contradicts the validity of the empty string. Is it a bug?
Ruby and JavaScript pass the test here.
> "lolsome".split("")
=> ["l", "o", "l", "s", "o", "m", "e"]
I've compiled several language examples from Rosetta code, it's interesting to note that they all allow the empty string in substring search and return true.
AWK
awk 'BEGIN { print index("lolsome", "") != 0 }'
C
int main() {
printf("%d\n", strstr("lolsome", "") != NULL);
return 0;
}
C++
#include <iostream>
#include <string>
int main() {
std::string s = "lolsome";
std::cout << (s.find("") != -1) << "\n";
return 0;
}
C#
using System;
class MainClass {
public static void Main (string[] args) {
string s = "lolsome";
Console.WriteLine(s.IndexOf("", 0, s.Length) != -1);
}
}
Clojure
(println (.indexOf "lolsome" ""))
Go
package main
import (
"fmt"
"strings"
)
func main() {
fmt.Println(strings.Index("lolsome", "") != -1)
}
Groovy
println 'lolsome'.indexOf('')
returns 0, on error returns -1
Java
class Main {
public static void main(String[] args) {
System.out.println("lolsome".indexOf("") != -1);
}
}
JavaScript
"lolsome".indexOf("") != -1
Lua
s = "lolsome"
print(s:find "" ~= nil)
Perl
print index("lolsome", "") != -1;
Python
"lolsome".find("") != -1
Ruby
"lolsome".index("") != nil