Remove punctuation from Unicode formatted strings

I have a function that removes punctuation from a list of strings:

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?


Solution 1:

You could use unicode.translate() method:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

You could also use r'\p{P}' that is supported by regex module:

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

Solution 2:

If you want to use J.F. Sebastian's solution in Python 3:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Solution 3:

You can iterate through the string using the unicodedata module's category function to determine if the character is punctuation.

For possible outputs of category, see unicode.org's doc on General Category Values

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

Additionally, make sure that you're handling encodings and types correctly. This presentation is a good place to start: http://bit.ly/unipain

Solution 4:

A little shorter version based on Daenyth answer

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)