Why is True returned when checking if an empty string is in another?

My limited brain cannot understand why this happens:

>>> print '' in 'lolsome'
True

In PHP, a equivalent comparison returns false:

var_dump(strpos('', 'lolsome'));

From the documentation:

For the Unicode and string types, x in y is true if and only if x is a substring of y. An equivalent test is y.find(x) != -1. Note, x and y need not be the same type; consequently, u'ab' in 'abc' will return True. Empty strings are always considered to be a substring of any other string, so "" in "abc" will return True.

From looking at your print call, you're using 2.x.

To go deeper, look at the bytecode:

>>> def answer():
...   '' in 'lolsome'

>>> dis.dis(answer)
  2           0 LOAD_CONST               1 ('')
              3 LOAD_CONST               2 ('lolsome')
              6 COMPARE_OP               6 (in)
              9 POP_TOP
             10 LOAD_CONST               0 (None)
             13 RETURN_VALUE

COMPARE_OP is where we are doing our boolean operation and looking at the source code for in reveals where the comparison happens:

    TARGET(COMPARE_OP)
    {
        w = POP();
        v = TOP();
        if (PyInt_CheckExact(w) && PyInt_CheckExact(v)) {
            /* INLINE: cmp(int, int) */
            register long a, b;
            register int res;
            a = PyInt_AS_LONG(v);
            b = PyInt_AS_LONG(w);
            switch (oparg) {
            case PyCmp_LT: res = a <  b; break;
            case PyCmp_LE: res = a <= b; break;
            case PyCmp_EQ: res = a == b; break;
            case PyCmp_NE: res = a != b; break;
            case PyCmp_GT: res = a >  b; break;
            case PyCmp_GE: res = a >= b; break;
            case PyCmp_IS: res = v == w; break;
            case PyCmp_IS_NOT: res = v != w; break;
            default: goto slow_compare;
            }
            x = res ? Py_True : Py_False;
            Py_INCREF(x);
        }
        else {
          slow_compare:
            x = cmp_outcome(oparg, v, w);
        }
        Py_DECREF(v);
        Py_DECREF(w);
        SET_TOP(x);
        if (x == NULL) break;
        PREDICT(POP_JUMP_IF_FALSE);
        PREDICT(POP_JUMP_IF_TRUE);
        DISPATCH();
    }

and where cmp_outcome is in the same file, it's easy to find our next clue:

res = PySequence_Contains(w, v);

which is in abstract.c:

{
    Py_ssize_t result;
    if (PyType_HasFeature(seq->ob_type, Py_TPFLAGS_HAVE_SEQUENCE_IN)) {
        PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
        if (sqm != NULL && sqm->sq_contains != NULL)
            return (*sqm->sq_contains)(seq, ob);
    }
    result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
    return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}

and to come up for air from the source, we find this next function in the documentation:

objobjproc PySequenceMethods.sq_contains
This function may be used by PySequence_Contains() and has the same signature. This slot may be left to NULL, in this case PySequence_Contains() simply traverses the sequence until it finds a match.

and further down in the same documentation:

int PySequence_Contains(PyObject *o, PyObject *value)
Determine if o contains value. If an item in o is equal to value, return 1, otherwise return 0. On error, return -1. This is equivalent to the Python expression value in o.

Where '' isn't null, the sequence 'lolsome' can be thought to contain it.

Quoting from the PHP's strpos documentation,

mixed strpos ( string $haystack , mixed $needle [, int $offset = 0 ] )
Find the numeric position of the first occurrence of needle in the haystack string.

So what you have actually tried is similar to the Python construct seen below

>>> print 'lolsome' in ''
False

So, you should actually have written like shown below to have the corresponding comparison in PHP

var_dump(strpos('lolsome', ''));

Even then it issues a warning and returns false.

PHP Warning: strpos(): Empty needle in /home/thefourtheye/Desktop/Test.php on line 3

bool(false)

I dug deeper and found the source code corresponding to the strpos function,

    if (!Z_STRLEN_P(needle)) {
        php_error_docref(NULL, E_WARNING, "Empty needle");
        RETURN_FALSE;
    }

They consider the empty string being searched as a problematic case. So, they are issuing a warning and returning false. Apart from this I couldn't find any document discussing why it is being considered as a problem.

As far as Python is concerned, this behaviour is well defined in the Comparisons section,

Empty strings are always considered to be a substring of any other string, so "" in "abc" will return True.

Basically, from math:

The empty set is a subset of every set

The same logic works here. You can consider '' an empty set. And therefore, it's a subset of every string set, since they must be the same type.

>>> a = ""
>>> b = "Python"
>>> a in b
True
>>> set(a).issubset(b)
True
>>> a = set() #empty set
>>> b = set([1,2,3])
>>> a.issubset(b)
True
>>>

But be careful! A subset and a membership are different things.

enter image description here

The empty string is the unique string of length zero.
The empty string is the identity element of the concatenation operation.
The empty string precedes any other string under lexicographical order, because it is the shortest of all strings.
The empty string is a legitimate string, upon which most string operations should work.
Wikipedia

 > strlen("");
=> 0
 > "a" . "" == "a";
=> true
 > "" . "a" == "a";
=> true   
 > "" < "\0";
=> true

From above, it seems PHP treats the empty string as a valid string.

> strstr("lolsome", "");
strstr(): Empty needle :1

But it doesn't seem to consider the empty string as fully legitimate one. Most probably PHP is the only language which doesn't allow the substring to be searched within a string to be an empty string.

Is it a defensive mechanism? Obviously, programmers don't have to protect the needle with if. If so, why other languages allow this test to pass!!! Language designers have to answer

What's a Python string made up of?

>>> ''.count('')
1

Obviously The empty string has one empty string.

>>> 'a'.count('')
2

One element string has two empty srings.

>>> 'ab'.count('')
3

So it seems Python string is concatenation of one element strings. Each element in a string is sandwiched between two empty strings.

>>> "lolsome".split('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: empty separator

But here Python contradicts the validity of the empty string. Is it a bug?
Ruby and JavaScript pass the test here.

 > "lolsome".split("")
=> ["l", "o", "l", "s", "o", "m", "e"]

I've compiled several language examples from Rosetta code, it's interesting to note that they all allow the empty string in substring search and return true.

AWK

awk 'BEGIN { print index("lolsome", "") != 0 }'

int main() {
    printf("%d\n", strstr("lolsome", "") != NULL);
    return 0;
}

C++

#include <iostream>
#include <string>

int main() {
    std::string s = "lolsome";
    std::cout << (s.find("") != -1) << "\n";
    return 0;
}

using System;
class MainClass {
  public static void Main (string[] args) {
    string s = "lolsome";
    Console.WriteLine(s.IndexOf("", 0, s.Length) != -1);
  }
}

Clojure

(println (.indexOf "lolsome" ""))

package main

import (
    "fmt"
    "strings"
)
func main() {
    fmt.Println(strings.Index("lolsome", "") != -1)
}

Groovy

println 'lolsome'.indexOf('')

returns 0, on error returns -1

Java

class Main {
  public static void main(String[] args) {
    System.out.println("lolsome".indexOf("") != -1);
  }
}

JavaScript

"lolsome".indexOf("") != -1

Lua

s = "lolsome"
print(s:find "" ~= nil)

Perl

print index("lolsome", "") != -1;

Python

"lolsome".find("") != -1

Ruby

"lolsome".index("") != nil

Why is True returned when checking if an empty string is in another?

Related

Recent Posts