Download large zip file from azure blob and unzip

I have currently below code which downloads a zip file from blob using SAS URI, unzips it and uploads the content to a new container

        var response = await new BlobClient(new Uri(sasUri)).DownloadAsync();
        using (ZipArchive archive = new ZipArchive(response.Value.Content))
        {
            foreach (ZipArchiveEntry entry in archive.Entries)
            {
                BlobClient blobClient = _blobServiceClient.GetBlobContainerClient(containerName).GetBlobClient(entry.FullName);
                using (var fileStream = entry.Open())
                {
                    await blobClient.UploadAsync(fileStream, true);
                }
            }
        }

The code for me fails with "stream too long" exception: System.IO.IOException: Stream was too long. at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count) at System.IO.Stream.CopyTo(Stream destination, Int32 bufferSize) at System.IO.Compression.ZipArchive.Init(Stream stream, ZipArchiveMode mode, Boolean leaveOpen).

My zip file size is 9G. What would be a better way to get around this exception? I'd like to avoid writing any files to disk.


Solution 1:

So the issue here is

  1. .Net has a finite size for an array (depending on the platform).
  2. Arrays back streams as a buffer or in memory data store.
  3. On 64-bit platforms the array size is 2 gigabytes
  4. You want to put a 9 gig stream (backed by an array) on the Large Object Heap.

So, you will need to allow larger objects (somehow)

Allow large objects

  • In .Net Framework 4.5+ you can set the <gcAllowVeryLargeObjects> project Element
  • In core you will need to set the Environment Variable COMPlus_gcAllowVeryLargeObjects

However, putting 9 gigs of anything on the large object heap is problematic, it's inefficient for the GC among other issues, and you should really avoid the LOH as much as you can.

Note depending on the library, and what you have access to. There might be less LOHy ways to do this. If you can supply your own streams / data structures there are libraries which can break up buffers so they don't get allocated aggressively on the LOH via things like ReadOnlySequence and Microsofts little known RecyclableMemoryStream.