How to filter string for unwanted characters using regex?

Edited based on your update:

dirtyString.replaceAll("[^a-zA-Z0-9]","")

If you're using guava on your project (and if you're not, I believe you should consider it), the CharMatcher class handles this very nicely:

Your first example might be:

result = CharMatcher.WHITESPACE.removeFrom(dirtyString);

while your second might be:

result = CharMatcher.anyOf(" *#&").removeFrom(dirtyString);
// or alternatively
result = CharMatcher.noneOf(" *#&").retainFrom(dirtyString);

or if you want to be more flexible with whitespace (tabs etc), you can combine them rather than writing your own:

CharMatcher illegal = CharMatcher.WHITESPACE.or(CharMatcher.anyOf("*#&"));
result = illegal.removeFrom(dirtyString);

or you might instead specify legal characters, which depending on your requirements might be:

CharMatcher legal = CharMatcher.JAVA_LETTER; // based on Unicode char class
CharMatcher legal = CharMatcher.ASCII.and(CharMatcher.JAVA_LETTER); // only letters which are also ASCII, as your examples
CharMatcher legal = CharMatcher.inRange('a', 'z'); // lowercase only
CharMatcher legal = CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z')); // either case

followed by retainFrom(dirtyString) as above.

Very nice, powerful API.


Use replaceAll.