All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?
I'm creating a simple wordcount program in Java that reads through a directory's text-based files.
However, I keep on getting the error:
java.nio.charset.MalformedInputException: Input length = 1
from this line of code:
BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));
I know I probably get this because I used a Charset
that didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.
I later learned at the JavaDocs that the Charset
is optional and only used for a more efficient reading of the files, so I changed the code to:
BufferedReader reader = Files.newBufferedReader(file);
But some files still throw the MalformedInputException
. I don't know why.
I was wondering if there is an all-inclusive Charset
that will allow me to read text files with many different types of characters?
Thanks.
Solution 1:
You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException
, try the next encoding.
Solution 2:
Creating BufferedReader from Files.newBufferedReader
Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);
when running the application it may throw the following exception:
java.nio.charset.MalformedInputException: Input length = 1
But
new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));
works well.
The different is that, the former uses CharsetDecoder default action.
The default action for malformed-input and unmappable-character errors is to report them.
while the latter uses the REPLACE action.
cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)
Solution 3:
ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-
req.setCharacterEncoding("ISO-8859-1");
I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.