C# Parallel.ForEach() memory usage keeps growing
public string SavePath { get; set; } = @"I:\files\";
public void DownloadList(List<string> list)
{
var rest = ExcludeDownloaded(list);
var result = Parallel.ForEach(rest, link=>
{
Download(link);
});
}
private void Download(string link)
{
using(var net = new System.Net.WebClient())
{
var data = net.DownloadData(link);
var fileName = code to generate unique fileName;
if (File.Exists(fileName))
return;
File.WriteAllBytes(fileName, data);
}
}
var downloader = new DownloaderService();
var links = downloader.GetLinks();
downloader.DownloadList(links);
I observed the usage of RAM for the project keeps growing
I guess there is something wrong on the Parallel.ForEach(), but I cannot figure it out.
Is there the memory leak, or what is happening?
Update 1
After changed to the new code
private void Download(string link)
{
using(var net = new System.Net.WebClient())
{
var fileName = code to generate unique fileName;
if (File.Exists(fileName))
return;
var data = net.DownloadFile(link, fileName);
Track theTrack = new Track(fileName);
theTrack.Title = GetCDName();
theTrack.Save();
}
}
I still observed increasing memory use after keeping running for 9 hours, it is much slowly growing usage though.
Just wondering, is it because that I didn't free the memory use of theTrack file?
Btw, I use ALT package for update file metadata, unfortunately, it doesn't implement IDisposable interface.
Solution 1:
Use WebClient.DownloadFile()
to download directly to a file so you don't have the whole file in memory.
Solution 2:
The Parallel.ForEach
method is intended for parallelizing CPU-bound workloads. Downloading a file is an I/O bound workload, and so the Parallel.ForEach
is not ideal for this case because it needlessly blocks ThreadPool
threads. The correct way to do it is asynchronously, with async/await. The recommended class for making asynchronous web requests is the HttpClient
, and for controlling the level of concurrency an excellent option is the TPL Dataflow library. For this case it is enough to use the simplest component of this library, the ActionBlock
class:
async Task DownloadListAsync(List<string> list)
{
using (var httpClient = new HttpClient())
{
var rest = ExcludeDownloaded(list);
var block = new ActionBlock<string>(async link =>
{
await DownloadFileAsync(httpClient, link);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 10
});
foreach (var link in rest)
{
await block.SendAsync(link);
}
block.Complete();
await block.Completion;
}
}
async Task DownloadFileAsync(HttpClient httpClient, string link)
{
var fileName = Guid.NewGuid().ToString(); // code to generate unique fileName;
var filePath = Path.Combine(SavePath, fileName);
if (File.Exists(filePath)) return;
var response = await httpClient.GetAsync(link);
response.EnsureSuccessStatusCode();
using (var contentStream = await response.Content.ReadAsStreamAsync())
using (var fileStream = new FileStream(filePath, FileMode.Create,
FileAccess.Write, FileShare.None, 32768, FileOptions.Asynchronous))
{
await contentStream.CopyToAsync(fileStream);
}
}
The code for downloading a file with HttpClient
is not as simple as the WebClient.DownloadFile()
, but it's what you have to do in order to keep the whole process asynchronous (both reading from the web and writing to the disk).
Caveat: Asynchronous filesystem operations are currently not implemented efficiently in .NET. For maximum efficiency it may be preferable to avoid using the FileOptions.Asynchronous
option in the FileStream
constructor.