为什么wget只下载某些网站的index.html? [英] Why does wget only download the index.html for some websites?

查看：19 发布时间：2021/9/24 20:11:33 wget

本文介绍了为什么wget只下载某些网站的index.html?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 wget 命令:

I'm trying to use wget command:

wget -p http://www.example.com

获取主页上的所有文件.对于某些网站，它可以工作，但在大多数情况下，它只下载 index.html.我试过 wget -r 命令，但它不起作用.有谁知道如何获取页面上的所有文件，或者只是给我一个页面上的文件列表和相应的网址?

to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?

推荐答案

Wget 还可以下载整个网站.但是因为这会给服务器带来沉重的负担，所以 wget 会服从 robots.txt 文件.

Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.

 wget -r -p http://www.example.com

-p 参数告诉 wget 包含所有文件，包括图像.这意味着所有的 HTML 文件都将按照它们的方式进行.

The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.

如果你不想让 wget 服从 robots.txt 文件怎么办?您可以像这样简单地将 -e robots=off 添加到命令中:

So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:

 wget -r -p -e robots=off http://www.example.com

由于许多站点不允许您下载整个站点，因此它们会检查您的浏览器身份.要解决这个问题，请使用我上面解释的 -U mozilla.

As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.

 wget -r -p -e robots=off -U mozilla http://www.example.com

许多网站所有者不会喜欢您下载他们整个网站的事实.如果服务器发现您正在下载大量文件，它可能会自动将您添加到它的黑名单中.解决这个问题的方法是在每次下载后等待几秒钟.使用 wget 执行此操作的方法是包含 --wait=X(其中 X 是秒数.)

A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)

您还可以使用参数:--random-wait 让 wget 选择一个随机的等待秒数.要将其包含在命令中:

you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:

wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com

这篇关于为什么wget只下载某些网站的index.html?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么wget只下载某些网站的index.html? [英] Why does wget only download the index.html for some websites?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么wget只下载某些网站的index.html? [英] Why does wget only download the index.html for some websites?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭