并行下载大量文件的有效方法 [英] Efficient way to download a huge load of files in parallel

查看:46
本文介绍了并行下载大量文件的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 Internet 下载大量文件(图片).我正在纠结异步/并行,因为

I'm trying to download a huge load of files(pictures) from the internet. I'm stuggling with async/parallel, because

a) 我不能说是否有文件.我刚收到一百万个链接,其中包含一张图片(300kb 到 3MB)或 404 页面不存在.因此,为了避免下载 0 字节文件,我两次询问同一页面,一次是 404,然后是图片.另一种方法是下载所有 0 字节文件并在之后删除数百万个文件 - 这使 Windows 10 一直停留在此任务上,直到我重新启动.

a) I cant say whether there is a file, or not. I just got a million links provided with either a singe picture (300kb to 3MB) or 404 page does not exist. So to escape downloading an 0-Byte file, i ask the same page twice, once for 404 and after that for the picture. THe other way would be downloading all 0-byte files and deleting millions of them afterwards - which keeps windows 10 stuck on this task until i reboot.

b) 虽然(非常慢的)下载正在进行中,但每当我查看任何成功下载的文件"时,它都是用 0 字节创建的,并且不包含图片.我需要更改什么才能在下载下一个文件之前真正下载文件?

b) While the (very slow) download is in progress, whenever I have a look at any of the "successfull downloaded files", it is created with 0 bytes and dont contain the picture. What do I need to change, to really download the file before going to download the next one?

我该如何解决这两个问题?有没有更好的方法来下载数以百万计的文件(无法在服务器上压缩/创建 .zip)

How do i fix this both issues? Is there any better way to download tousands or millions of files (compression/creating .zip on the server is not possible)

           //loopResult = Parallel.ForEach(_downloadLinkList, new ParallelOptions { MaxDegreeOfParallelism = 10 }, DownloadFilesParallel);    
            private async void DownloadFilesParallel(string path)
            {
                string downloadToDirectory = ""; 
                string x = ""; //in case x fails, i get 404 from webserver and therefore no download is needed
                System.Threading.Interlocked.Increment(ref downloadCount);
                OnNewListEntry(downloadCount.ToString() + " / " + linkCount.ToString() + " heruntergeladen"); //tell my gui to update
                try
                {
                    using(WebClient webClient = new WebClient())
                    {
                        downloadToDirectory = Path.Combine(savePathLocalComputer, Path.GetFileName(path)); //path on local computer

                        webClient.Credentials = CredentialCache.DefaultNetworkCredentials;
                        x = await webClient.DownloadStringTaskAsync(new Uri(path)); //if this throws an exception, ignore this link
                        Directory.CreateDirectory(Path.GetDirectoryName(downloadToDirectory)); //if request is successfull, create -if needed- the folder on local pc
                        await webClient.DownloadFileTaskAsync(new Uri(path), @downloadToDirectory); //should download the file, release 1 parallel task to get the next file. instead there is a 0-byte file and the next one will be downloaded
                    }
                }
                catch(WebException wex)
                {
                }
                catch(Exception ex)
                {
                    System.Diagnostics.Debug.WriteLine(ex.Message);
                }
                finally
                {
                    
                }
            }

//图片为sfw,链接为nsfw

//picture is sfw, link is nsfw

推荐答案

这是使用 HttpClient 有最大并发下载限制.

Here's the example using HttpClient with limit of maximum concurrent downloads.

private static readonly HttpClient client = new HttpClient();

private async Task DownloadAndSaveFileAsync(string path, SemaphoreSlim semaphore, IProgress<int> status)
{
    try
    {
        status?.Report(semaphore.CurrentCount);
        using (HttpResponseMessage response = await client.GetAsync(path, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false))
        {
            if (response.IsSuccessStatusCode) // ignoring if not success
            {
                string filePath = Path.Combine(savePathLocalComputer, Path.GetFileName(path));
                string dir = Path.GetDirectoryName(filePath);
                if (!Directory.Exists(dir)) Directory.CreateDirectory(dir);
                using (Stream responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false))
                using (FileStream fileStream = File.Create(filePath))
                {
                    await responseStream.CopyToAsync(fileStream).ConfigureAwait(false);
                }
            }
        }
    }
    finally
    {
        semaphore.Release();
    }
}

并发

client.BaseAddress = "http://somesite";
int downloadCount = 0;
List<string> pathList = new List<string>();
// fill the list here

List<Task> tasks = new List<Task>();
int maxConcurrentTasks = Environment.ProcessorCount * 2; // 16 for me

IProgress<int> status = new Progress<int>(availableTasks =>
{
    downloadCount++;
    OnNewListEntry(downloadCount + " / " + pathList.Count + " heruntergeladen\r\nRunning " + (maxConcurrentTasks - availableTasks) + " downloads.");
});

using (SemaphoreSlim semaphore = new SemaphoreSlim(maxConcurrentTasks))
{
    foreach (string path in pathList)
    {
        await semaphore.WaitAsync();
        tasks.Add(DownloadAndSaveFileAsync(path, semaphore, status));
    }
    try
    {
        await Task.WhenAll(tasks);
    }
    catch (Exception ex)
    {
        // handle the Exception here
    }
}

Progress 这里只是在 UI 线程上执行回调.因此 Interlocked 内部不需要,更新 UI 是安全的.

Progress here simply executes callback on UI Thread. Thus Interlocked is not needed inside and it's safe to update UI.

如果是 .NET Framework(在 .NET Core 中没有效果但不需要)为了使其更快,您可以将此行添加到应用程序启动代码中

In case of .NET Framework (in .NET Core has no effect but not needed) to make it faster, you may add this line to the app startup code

ServicePointManager.DefaultConnectionLimit = 10;

这篇关于并行下载大量文件的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆