镜像http网站,不包括某些文件 [英] mirror http website, excluding certain files

查看:33
本文介绍了镜像http网站,不包括某些文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将一个简单的受密码保护的 Web 门户镜像到一些我想保持镜像的数据最新.本质上,这个网站只是一个目录列表,其中的数据组织到文件夹中我真的不在乎保留 html 文件 &其他格式元素.但是有一些大文件类型太大而无法下载,所以我想忽略这些.

I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements. However there are some huge file types that are too large to download, so I want to ignore these.

使用 wget -m -R/--reject 标志几乎可以满足我的要求,除了下载所有文件,然后如果它们与 -R 标志匹配,则它们会被删除.

Using the wget -m -R/--reject flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.

这是我使用 wget 的方式:

Here's how i'm using wget:

wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/

产生这样的输出,确认排除的文件(index.html)(a)被下载,(b)然后被删除:

Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:

...
--2012-05-23 09:38:38-- http://web.server.org/folder/
重用与 web.server.org:80 的现有连接.
已发送 HTTP 请求,正在等待响应... 401 需要授权
重用与 web.server.org:80 的现有连接.
已发送 HTTP 请求,正在等待响应... 200 OK
长度:2677 (2.6K) [文本/html]
保存到:`web.server.org/folder/index.html'100%[======================================================================================================================>] 2,677 --.-K/s in 0s

...
--2012-05-23 09:38:38-- http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html' 100%[======================================================================================================================>] 2,677 --.-K/s in 0s

上次修改的标题丢失——时间戳关闭.
2012-05-23 09:38:39 (328 MB/s) - 保存了web.server.org/folder/index.html" [2677/2677]

删除 web.server.org/folder/index.html 因为它应该被拒绝.

...

Last-modified header missing -- time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) - `web.server.org/folder/index.html' saved [2677/2677]

Removing web.server.org/folder/index.html since it should be rejected.

...

有没有办法在下载之前强制 wget 拒绝文件?
有没有我应该考虑的替代方案?

is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?

另外,尽管提供了用户名和文件,但为什么每个下载的文件都会出现 401 Authorization Required 错误?密码.这就像 wget 在尝试用户名/密码之前每次都尝试未经身份验证的连接.

Also, why do i get a 401 Authorization Required error for every downloaded file, despite supplying username & password. It's like wget tries to connect un-authenticated every time, before trying the username/password.

谢谢,马克

推荐答案

Pavuk (http://www.pavuk.org) 看起来是一个很有前途的替代方案,它允许您镜像网站,不包括基于 url 模式的文件和文件扩展名......但是 pavuk 0.9.35 段错误/在长时间传输过程中随机死亡似乎并未积极开发(此版本于 2008 年 11 月构建).

Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).

仅供参考,这是我使用它的方式:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir/path/to/root -subdir/path/to/root -skip_url_pattern '*icons*' -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.组织/文件夹 2>&1 |tee pavuk-date.log

FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-date.log

最后,wget --exclude-directories 成功了:

wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder

由于 --exclude-directories 通配符不跨越/",您需要非常具体地形成查询以避免下载整个文件夹.

Since the --exclude-directories wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

标记

这篇关于镜像http网站,不包括某些文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆