Fastest way to strip all non-printable characters from a Java String
Solution 1:
using 1 char array could work a bit better
int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;
newLen++;
}
}
s = new String(oldChars, 0, newLen);
and I avoided repeated calls to s.length();
another micro-optimization that might work is
int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
// if there are none it ends on the null char I appended
for (int j = newLen; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
newLen++;
}
}
s = new String(oldChars, 0, newLen);
Solution 2:
If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:
char [] oldChars = new char[5];
String stripControlChars(String s)
{
final int inputLen = s.length();
if ( oldChars.length < inputLen )
{
oldChars = new char[inputLen];
}
s.getChars(0, inputLen, oldChars, 0);
etc...
This is a big win - 20% or so, as I understand the current best case.
If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.
Solution 3:
Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.
As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.
Anyways here we go:
// Has to be done only once - so cache those! Prohibitively expensive otherwise
private Field value;
private Field offset;
private Field count;
private Field hash;
{
try {
value = String.class.getDeclaredField("value");
value.setAccessible(true);
offset = String.class.getDeclaredField("offset");
offset.setAccessible(true);
count = String.class.getDeclaredField("count");
count.setAccessible(true);
hash = String.class.getDeclaredField("hash");
hash.setAccessible(true);
}
catch (NoSuchFieldException e) {
throw new RuntimeException();
}
}
@Override
public String strip(final String old) {
final int length = old.length();
char[] chars = null;
int off = 0;
try {
chars = (char[]) value.get(old);
off = offset.getInt(old);
}
catch(IllegalArgumentException e) {
throw new RuntimeException(e);
}
catch(IllegalAccessException e) {
throw new RuntimeException(e);
}
int newLen = off;
for(int j = off; j < off + length; j++) {
final char ch = chars[j];
if (ch >= ' ') {
chars[newLen] = ch;
newLen++;
}
}
if (newLen - off != length) {
// We changed the internal state of the string, so at least
// be friendly enough to correct it.
try {
count.setInt(old, newLen - off);
// Have to recompute hash later on
hash.setInt(old, 0);
}
catch(IllegalArgumentException e) {
e.printStackTrace();
}
catch(IllegalAccessException e) {
e.printStackTrace();
}
}
// Well we have to return something
return old;
}
For my teststring that gets 3477148.18ops/s
vs. 2616120.89ops/s
for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.
Solution 4:
You could split the task into a several parallel subtasks, depending of processor's quantity.
Solution 5:
I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.
Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).