许多html文件优化下载 [英] Optimizing download of many html files

查看:148
本文介绍了许多html文件优化下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一百万个网址指向了我要救我的磁盘公共网络服务器上的HTML页面。每个这些是大约相同的尺寸,约30千字节。我的URL列表被分成大约平均在20个文件夹在磁盘上,所以为了简单起见,我创建一个的任务的每个文件夹,并在每个任务我下载一个网址后,另外,按顺序。所以,让我在任何时间约20并行请求。我在一个比较糟糕的DSL,5Mbps的连接。

I have about a million urls pointing to HTML pages on a public web server that I want to save to my disk. Each of these is about the same size, ~30 kilobytes. My url lists are split about evenly in 20 folders on disk, so for simplicity I create one Task per folder, and in each task I download one URL after the other, sequentially. So that gives me about 20 parallel requests at any time. I'm on a relatively crappy DSL, 5mbps connection.

这再presents数据的数千兆字节,所以我很期待这个过程需要几个小时,但我想知道如果我能做出任何方法更有效。难道可能我做的最出我的连接?我该如何衡量? 20并行下载好一些,或者我应该拨打上涨或下跌?

This represents several gigabytes of data so I'm expecting the process to take several hours, but I'm wondering if I could make the approach any more efficient. Is it likely I'm making the most out of my connection? How can I measure that? Is 20 parallel downloads a good number or should I dial up or down?

的语言F#,我使用WebClient.DownloadFile为每个URL,每个任务一WebClient的。

The language is F#, I'm using WebClient.DownloadFile for every url, one WebClient per task.

==================================

==================================

编辑:有一件事,作出了巨大的差异,增加了一定的头请求:

One thing that made a huge difference was adding a certain header to the request:

let webClient = new WebClient()
webClient.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate")

该切下载的大小约为32K至9K,造成巨大的速度上涨,并节省磁盘空间。感谢TerryE的提吧!

This cut the size of downloads from about 32k to 9k, resulting in enormous speed gains and disk space savings. Thanks to TerryE for mentioning it!

推荐答案

如果您使用的是下载的API,然后确保它发出

If you are using a downloader API, then make sure that it is issuing a

接受编码:gzip压缩,紧缩

Accept-Encoding: gzip, deflate

请求头,这样,你是刮站点知道返回COM pressed HTML。 (大多数Web服务器西港岛线被配置为COM preSS的HTML数据流,如果客户端使用此请求头让服务器知道它会接受COM pressed数据流。)

request header so that the site that you are scraping knows to return compressed HTML. (Most webservers wil be configured to compress the HTML data streams if the client uses this request header to let the server know that it will accept compressed data streams.)

这将减少(例如这个页面有40K原始的HTML,但只有10K被转移到我的浏览器(HTML是压缩)。

This will reduce the data transferred by roughly a factor of 4. (E.g. this page was 40K raw HTML, but only 10K was transferred to my browser (the HTML is zipped).

这篇关于许多html文件优化下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆