Remove all non-"word characters" from a String in Java, leaving accented characters?
Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.
"TESTÜTEST".replaceAll( "\\W", "" )
returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of
"[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"
only to realize I forgot ô?
Solution 1:
Use [^\p{L}\p{Nd}]+
- this matches all (Unicode) characters that are neither letters nor (decimal) digits.
In Java:
String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");
Edit:
I changed \p{N}
to \p{Nd}
because the former also matches some number symbols like ¼
; the latter doesn't. See it on regex101.com.
Solution 2:
I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):
> String s = "äêìóblah";
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));
You should see the following output:
true
#blah
Best,
Solution 3:
At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:
import java.text.Normalizer;
import java.text.Normalizer.Form;
import org.apache.commons.lang.StringUtils;
/**
* Utility class for String manipulation.
*
* @author Stefan Haberl
*/
public abstract class TextUtils {
private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
"sz" };
/**
* Normalizes a String by removing all accents to original 127 US-ASCII
* characters. This method handles German umlauts and "sharp-s" correctly
*
* @param s
* The String to normalize
* @return The normalized String
*/
public static String normalize(String s) {
if (s == null)
return null;
String n = null;
n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");
return n;
}
/**
* Returns a clean representation of a String which might be used safely
* within an URL. Slugs are a more human friendly form of URL encoding a
* String.
* <p>
* The method first normalizes a String, then converts it to lowercase and
* removes ASCII characters, which might be problematic in URLs:
* <ul>
* <li>all whitespaces
* <li>dots ('.')
* <li>(semi-)colons (';' and ':')
* <li>equals ('=')
* <li>ampersands ('&')
* <li>slashes ('/')
* <li>angle brackets ('<' and '>')
* </ul>
*
* @param s
* The String to slugify
* @return The slugified String
* @see #normalize(String)
*/
public static String slugify(String s) {
if (s == null)
return null;
String n = normalize(s);
n = StringUtils.lowerCase(n);
n = n.replaceAll("[\\s.:;&=<>/]", "");
return n;
}
}
Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.
HTH
EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.
Solution 4:
Well, here is one solution I ended up with, but I hope there's a more elegant one...
StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
char tmpChar = name.charAt( i );
if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
result.append( tmpChar );
}
}
result
ends up with the desired result...
Solution 5:
You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.