Sorting list of string with specific locale in python

Solution 1:

You could use a PyICU's collator to avoid changing global settings:

import icu # PyICU

def sorted_strings(strings, locale=None):
    if locale is None:
       return sorted(strings)
    collator = icu.Collator.createInstance(icu.Locale(locale))
    return sorted(strings, key=collator.getSortKey)

Example:

>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted_strings(L)
['angel delight', 'custard', 'glühwein', 'sandwiches', 'éclairs']
>>> sorted_strings(L, 'en_US')
['angel delight', 'custard', 'éclairs', 'glühwein', 'sandwiches']

Disadvantage: dependency on PyICU library; the behavior is slightly different from locale.strcoll.


I don't know how to get locale.strxfrm function given a locale name without changing it globally. As a hack you could run your function in a different child process:

pool = multiprocessing.Pool()
# ...
pool.apply(locale_aware_sort, [strings, loc])

Disadvantage: might be slow, resource hungry


Using ordinary threading.Lock won't work unless you can control every place where locale aware functions (they are not limited to locale module e.g., re) could be called from multiple threads.


You could compile your function using Cython to synchronize access using GIL. GIL will make sure that no other Python code can be executed while your function is running.

Disadvantage: not pure Python

Solution 2:

The ctypes solution is fine, but if anyone in the future would like just to modify your original solution, here is a way how to do so:

Temporary changes of global settings can safely be accomplished with a context manager.

from contextlib import contextmanager
import locale

@contextmanager
def changedlocale(newone):
    old_locale = locale.getlocale(locale.LC_COLLATE)
    try:
        locale.setlocale(locale.LC_COLLATE, newone)
        yield locale.strcoll
    finally:
        locale.setlocale(locale.LC_COLLATE, old_locale)

def sort_strings(strings, locale_=None):
    if locale_ is None:
        return sorted(strings)

    with changedlocale(locale_) as strcoll:
        return sorted(strings, cmp=strcoll)

    return sorted_strings

This ensures a clean restoration of the original locale - as long as you don't use threading.

Solution 3:

Glibc does support a locale API with an explicit state. Here's a quick wrapper for that API made with ctypes.

# -*- coding: utf-8
import ctypes


class Locale(object):
    def __init__(self, locale):
        LC_ALL_MASK = 8127
        # LC_COLLATE_MASK = 8
        self.libc = ctypes.CDLL("libc.so.6")
        self.ctx = self.libc.newlocale(LC_ALL_MASK, locale, 0)



    def strxfrm(self, src, iteration=1):
        size = 3 * iteration * len(src)
        dest =  ctypes.create_string_buffer('\000' * size)
        n = self.libc.strxfrm_l(dest, src, size,  self.ctx)
        if n < size:
            return dest.value
        elif iteration<=4:
            return self.strxfrm(src, iteration+1)
        else:
            raise Exception('max number of iterations trying to increase dest reached')


    def __del__(self):
        self.libc.freelocale(self.ctx)

and a short test

locale1 = Locale('C')
locale2 = Locale('mk_MK.UTF-8')

a_list = ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']
import random
random.shuffle(a_list)

assert sorted(a_list, key=locale1.strxfrm) == ['а', 'б', 'в', 'ш', 'ј', 'ќ', 'џ']
assert sorted(a_list, key=locale2.strxfrm) == ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']

what's left to do is implement all the locale functions, support for python unicode strings (with wchar* functions I guess), and automatically import the include file definitions or something