Ignore case in Python strings [duplicate]
What is the easiest way to compare strings in Python, ignoring case?
Of course one can do (str1.lower() <= str2.lower()), etc., but this created two additional temporary strings (with the obvious alloc/g-c overheads).
I guess I'm looking for an equivalent to C's stricmp().
[Some more context requested, so I'll demonstrate with a trivial example:]
Suppose you want to sort a looong list of strings. You simply do theList.sort(). This is O(n * log(n)) string comparisons and no memory management (since all strings and list elements are some sort of smart pointers). You are happy.
Now, you want to do the same, but ignore the case (let's simplify and say all strings are ascii, so locale issues can be ignored). You can do theList.sort(key=lambda s: s.lower()), but then you cause two new allocations per comparison, plus burden the garbage-collector with the duplicated (lowered) strings. Each such memory-management noise is orders-of-magnitude slower than simple string comparison.
Now, with an in-place stricmp()-like function, you do: theList.sort(cmp=stricmp) and it is as fast and as memory-friendly as theList.sort(). You are happy again.
The problem is any Python-based case-insensitive comparison involves implicit string duplications, so I was expecting to find a C-based comparisons (maybe in module string).
Could not find anything like that, hence the question here. (Hope this clarifies the question).
Solution 1:
Here is a benchmark showing that using str.lower
is faster than the accepted answer's proposed method (libc.strcasecmp
):
#!/usr/bin/env python2.7
import random
import timeit
from ctypes import *
libc = CDLL('libc.dylib') # change to 'libc.so.6' on linux
with open('/usr/share/dict/words', 'r') as wordlist:
words = wordlist.read().splitlines()
random.shuffle(words)
print '%i words in list' % len(words)
setup = 'from __main__ import words, libc; gc.enable()'
stmts = [
('simple sort', 'sorted(words)'),
('sort with key=str.lower', 'sorted(words, key=str.lower)'),
('sort with cmp=libc.strcasecmp', 'sorted(words, cmp=libc.strcasecmp)'),
]
for (comment, stmt) in stmts:
t = timeit.Timer(stmt=stmt, setup=setup)
print '%s: %.2f msec/pass' % (comment, (1000*t.timeit(10)/10))
typical times on my machine:
235886 words in list
simple sort: 483.59 msec/pass
sort with key=str.lower: 1064.70 msec/pass
sort with cmp=libc.strcasecmp: 5487.86 msec/pass
So, the version with str.lower
is not only the fastest by far, but also the most portable and pythonic of all the proposed solutions here.
I have not profiled memory usage, but the original poster has still not given a compelling reason to worry about it. Also, who says that a call into the libc module doesn't duplicate any strings?
NB: The lower()
string method also has the advantage of being locale-dependent. Something you will probably not be getting right when writing your own "optimised" solution. Even so, due to bugs and missing features in Python, this kind of comparison may give you wrong results in a unicode context.
Solution 2:
Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:
Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US")
'en_US'
>>> sorted("ABCabc", key=locale.strxfrm)
['a', 'A', 'b', 'B', 'c', 'C']
>>> sorted("ABCabc", cmp=locale.strcoll)
['a', 'A', 'b', 'B', 'c', 'C']
Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.
Solution 3:
Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.
The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:
from timeit import Timer
s1 = "1234567890" * 100000 + "a"
s2 = "1234567890" * 100000 + "B"
code = "s1.lower() < s2.lower()"
time = Timer(code, "from __main__ import s1, s2").timeit(1000)
print time / 1000 # 0.00920499992371 on my machine
If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html
Solution 4:
I can't find any other built-in way of doing case-insensitive comparison: The python cook-book recipe uses lower().
However you have to be careful when using lower for comparisons because of the Turkish I problem. Unfortunately Python's handling for Turkish Is is not good. ı is converted to I, but I is not converted to ı. İ is converted to i, but i is not converted to İ.
Solution 5:
There's no built in equivalent to that function you want.
You can write your own function that converts to .lower() each character at a time to avoid duplicating both strings, but I'm sure it will very cpu-intensive and extremely inefficient.
Unless you are working with extremely long strings (so long that can cause a memory problem if duplicated) then I would keep it simple and use
str1.lower() == str2.lower()
You'll be ok