wget拒绝仍然下载文件 [英] wget reject still downloads file

查看:33
本文介绍了wget拒绝仍然下载文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只想要文件夹结构,但我不知道如何使用 wget.相反,我正在使用这个:

I only want the folder structure, but I couldn't figure out how with wget. Instead I am using this:

wget -R pdf,css,gif,txt,png -np -r http://example.com

wget -R pdf,css,gif,txt,png -np -r http://example.com

应该拒绝 -R 之后的所有文件,但在我看来 wget 仍然下载文件,然后将其删除.

Which should reject all the files after -R, but it seems to me wget still downloads the file, then deletes it.

有没有更好的方法来获取文件夹结构?

Is there a better way to just get the folder structure?

TTP 请求已发送,正在等待响应...200 OK 长度:136796 (134K)[应用程序/x-下载] 保存到:example.com/file.pdf"

TTP request sent, awaiting response... 200 OK Length: 136796 (134K) [application/x-download] Saving to: "example.com/file.pdf"

100%[======================================>] 136,796 853K/s in0.2s

100%[=====================================>] 136,796 853K/s in 0.2s

2012-10-03 03:51:41 (853 KB/s) -example.com/file.pdf"已保存 [136796/136796]

2012-10-03 03:51:41 (853 KB/s) - "example.com/file.pdf" saved [136796/136796]

正在删除example.com/file.pdf 因为它应该被拒绝.

Removing example.com/file.pdf since it should be rejected.

如果有人想知道这是给客户的,他们可以告诉我结构,但这很麻烦,因为他们的 IT 人员必须这样做,所以我想自己弄.

If anyone was wondering this is for a client which they can tell me the structure but it's a hassle since their IT guy has to do it, so I wanted to just get it myself.

推荐答案

这似乎是 wget 的设计方式.在执行递归下载时,仍会下载与拒绝列表匹配的非叶文件,以便收集链接,然后将其删除.

That appears to be how wget was designed to work. When performing recursive downloads, non-leaf files that match the reject list are still downloaded so they can be harvested for links, then deleted.

来自代码内注释 (recur.c):

From the in-code comments (recur.c):

要么指定了--delete-after,要么我们加载了这个否则会被拒绝(例如通过 -R)HTML 文件,所以我们可以收获它的超链接——在任何一种情况下,删除本地文件.

Either --delete-after was specified, or we loaded this otherwise rejected (e.g. by -R) HTML file just so we could harvest its hyperlinks -- in either case, delete the local file.

我们在过去的项目中遇到过这个问题,我们不得不镜像一个经过身份验证的站点,并且 wget 不断点击 logout 页面,即使它是旨在拒绝这些 URL.我们找不到任何选项来更改 wget 的行为.

We've had a run-in with this in a past project where we had to mirror an authenticated site and wget keeps hitting the logout pages even when it was meant to reject those URLs. We could not find any options to change the behaviour of wget.

我们最终得到的解决方案是下载,破解并构建我们自己的wget.对此可能有更优雅的方法,但我们使用的快速解决方案是将以下规则添加到 download_child_p() 例程(根据您的要求进行修改):

The solution we ended up with was to download, hack and build our own version of wget. There's probably a more elegant approach to this, but the quick fix we used was to add the following rules to the end of the download_child_p() routine (modified to match your requirements):

  /* Extra rules */
  if (match_tail(url, ".pdf", 0)) goto out;
  if (match_tail(url, ".css", 0)) goto out;
  if (match_tail(url, ".gif", 0)) goto out;
  if (match_tail(url, ".txt", 0)) goto out;
  if (match_tail(url, ".png", 0)) goto out;
  /* --- end extra rules --- */

  /* The URL has passed all the tests.  It can be placed in the
     download queue. */
  DEBUGP (("Decided to load it.\n"));

  return 1;

 out:
  DEBUGP (("Decided NOT to load it.\n"));

  return 0;
}

这篇关于wget拒绝仍然下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆