How to read a text file with mixed encodings in Scala or Java?
This is how I managed to do it with java:
FileInputStream input;
String result = null;
try {
input = new FileInputStream(new File("invalid.txt"));
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
InputStreamReader reader = new InputStreamReader(input, decoder);
BufferedReader bufferedReader = new BufferedReader( reader );
StringBuilder sb = new StringBuilder();
String line = bufferedReader.readLine();
while( line != null ) {
sb.append( line );
line = bufferedReader.readLine();
}
bufferedReader.close();
result = sb.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch( IOException e ) {
e.printStackTrace();
}
System.out.println(result);
The invalid file is created with bytes:
0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94
Which is hellö wörld
in UTF-8 with 4 invalid bytes mixed in.
With .REPLACE
you see the standard unicode replacement character being used:
//"h�ellö� wö�rld�"
With .IGNORE
, you see the invalid bytes ignored:
//"hellö wörld"
Without specifying .onMalformedInput
, you get
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
at sun.nio.cs.StreamDecoder.read(Unknown Source)
at java.io.InputStreamReader.read(Unknown Source)
at java.io.BufferedReader.fill(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
Scala's Codec has a decoder field which returns a java.nio.charset.CharsetDecoder
:
val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
Source.fromFile(filename)(decoder).getLines().toList
The solution for scala's Source (based on @Esailija answer):
def toSource(inputStream:InputStream): scala.io.BufferedSource = {
import java.nio.charset.Charset
import java.nio.charset.CodingErrorAction
val decoder = Charset.forName("UTF-8").newDecoder()
decoder.onMalformedInput(CodingErrorAction.IGNORE)
scala.io.Source.fromInputStream(inputStream)(decoder)
}