镜像http网站,不包括某些文件 [英] mirror http website, excluding certain files
问题描述
我想将一个简单的受密码保护的 Web 门户镜像到一些我想保持镜像的数据最新.本质上,这个网站只是一个目录列表,其中的数据组织到文件夹中我真的不在乎保留 html 文件 &其他格式元素.但是有一些大文件类型太大而无法下载,所以我想忽略这些.
I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements. However there are some huge file types that are too large to download, so I want to ignore these.
使用 wget -m -R/--reject
标志几乎可以满足我的要求,除了下载所有文件,然后如果它们与 -R 标志匹配,则它们会被删除.
Using the wget -m -R/--reject
flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.
这是我使用 wget
的方式:
Here's how i'm using wget
:
wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/
产生这样的输出,确认排除的文件(index.html)(a)被下载,(b)然后被删除:
Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:
...
--2012-05-23 09:38:38-- http://web.server.org/folder/
重用与 web.server.org:80 的现有连接.
已发送 HTTP 请求,正在等待响应... 401 需要授权
重用与 web.server.org:80 的现有连接.
已发送 HTTP 请求,正在等待响应... 200 OK
长度:2677 (2.6K) [文本/html]
保存到:`web.server.org/folder/index.html'100%[======================================================================================================================>] 2,677 --.-K/s in 0s
...
--2012-05-23 09:38:38-- http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html' 100%[======================================================================================================================>] 2,677 --.-K/s in 0s
上次修改的标题丢失——时间戳关闭.
2012-05-23 09:38:39 (328 MB/s) - 保存了web.server.org/folder/index.html" [2677/2677]
删除 web.server.org/folder/index.html 因为它应该被拒绝.
...
Last-modified header missing -- time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) - `web.server.org/folder/index.html' saved [2677/2677]
Removing web.server.org/folder/index.html since it should be rejected.
...
有没有办法在下载之前强制 wget 拒绝文件?
有没有我应该考虑的替代方案?
is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?
另外,尽管提供了用户名和文件,但为什么每个下载的文件都会出现 401 Authorization Required
错误?密码.这就像 wget
在尝试用户名/密码之前每次都尝试未经身份验证的连接.
Also, why do i get a 401 Authorization Required
error for every downloaded file, despite supplying username & password. It's like wget
tries to connect un-authenticated every time, before trying the username/password.
谢谢,马克
推荐答案
Pavuk (http://www.pavuk.org) 看起来是一个很有前途的替代方案,它允许您镜像网站,不包括基于 url 模式的文件和文件扩展名......但是 pavuk 0.9.35 段错误/在长时间传输过程中随机死亡似乎并未积极开发(此版本于 2008 年 11 月构建).
Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).
仅供参考,这是我使用它的方式:pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir/path/to/root -subdir/path/to/root -skip_url_pattern '*icons*' -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.组织/文件夹 2>&1 |tee pavuk-
date.log
FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-
date.log
最后,wget --exclude-directories
成功了:
wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder
由于 --exclude-directories
通配符不跨越/",您需要非常具体地形成查询以避免下载整个文件夹.
Since the --exclude-directories
wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.
标记
这篇关于镜像http网站,不包括某些文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!