Reading large file in bytes by chunks with dynamic buffer size

I'm trying to read a large file by chunks and save them in an ArrayList of bytes.

My code, in short, looks like this:

public ArrayList<byte[]> packets = new ArrayList<>();
FileInputStream fis = new FileInputStream("random_text.txt");
byte[] buffer = new byte[512];
while (fis.read(buffer) > 0){
  packets.add(buffer);
}
fis.close();

But it has a behavior that I don't know how to solve, for example: If a file has only the words "hello world", this chunk does not necessarily need to be 512 bytes long. In fact, I want each chunk to be a maximum of 512 bytes not that they all necessarily have that size.


Solution 1:

First of all, what you are doing is probably a bad idea. Storing a file's contents in memory like this is liable to be a waste of heap space ... and can lead to OutOfMemoryError exceptions and / or a requirement for an excessively large heap if you process large (enough) input files.

The second problem is that your code is wrong. You are repeatedly reading the data into the same byte array. Each time you do, it overwrites what was there before. So you will end up will a list containing lots of reference to a single byte array ... containing just the last chunk of data that you read.

To solve the problem that you asked about1, you will need to copy the chunk that you read to a new (smaller) byte array.

Something like this:

public ArrayList<byte[]> packets = new ArrayList<>();
try (FileInputStream fis = new FileInputStream("random_text.txt")) {
    byte[] buffer = new byte[512];
    int len;
    while ((len = fis.read(buffer)) > 0) {
        packets.add(Arrays.copyOf(buffer, len));
    }
}

Note that this also deals with the second problem I mentioned. And fixes a potential resource leak by using try with resource syntax to manage the closure of the input stream.

A final issue: If this is really a text file that you are reading, you probably should be using a Reader to read it, and char[] or String to hold it.

But even if you do that there are some awkward edge cases if your text contains Unicode codepoints that are not in code plane 0. For example, emojis. The edge cases will involve code points that are represented as a surrogate pair AND the pair being split on a chunk boundary. Reading and storing the text as lines would avoid that.


1 - The issue here is not the "wasted" space. Unless you are reading and caching a large number of small file, any space wastage due to "short" chunks will be unimportant. The important issue is knowing which bytes in each byte[] are actually valid data.