Default character encoding for java console output
I'm assuming that your console still runs under cmd.exe. I doubt your console is really expecting UTF-8 - I expect it is really an OEM DOS encoding (e.g. 850 or 437.)
Java will encode bytes using the default encoding set during JVM initialization.
Reproducing on my PC:
java Foo
Java encodes as windows-1252; console decodes as IBM850. Result: Mojibake
java -Dfile.encoding=UTF-8 Foo
Java encodes as UTF-8; console decodes as IBM850. Result: Mojibake
cat test.txt
cat decodes file as UTF-8; cat encodes as IBM850; console decodes as IBM850.
java Foo | cat
Java encodes as windows-1252; cat decodes as windows-1252; cat encodes as IBM850; console decodes as IBM850
java -Dfile.encoding=UTF-8 Foo | cat
Java encodes as UTF-8; cat decodes as UTF-8; cat encodes as IBM850; console decodes as IBM850
This implementation of cat must use heuristics to determine if the character data is UTF-8 or not, then transcodes the data from either UTF-8 or ANSI (e.g. windows-1252) to the console encoding (e.g. IBM850.)
This can be confirmed with the following commands:
$ java HexDump utf8.txt
78 78 c3 a4 c3 b1 78 78
$ cat utf8.txt
xxäñxx
$ java HexDump ansi.txt
78 78 e4 f1 78 78
$ cat ansi.txt
xxäñxx
The cat command can make this determination because e4 f1
is not a valid UTF-8 sequence.
You can correct the Java output by:
- Setting the console encoding to the system ANSI value
- Using the Console type
- Using some shiv layer as you are doing with cat
HexDump is a trivial Java application:
import java.io.*;
class HexDump {
public static void main(String[] args) throws IOException {
try (InputStream in = new FileInputStream(args[0])) {
int r;
while((r = in.read()) != -1) {
System.out.format("%02x ", 0xFF & r);
}
System.out.println();
}
}
}