为什么wget只下载某些网站的index.html? [英] Why does wget only download the index.html for some websites?

查看:19
本文介绍了为什么wget只下载某些网站的index.html?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 wget 命令:

I'm trying to use wget command:

wget -p http://www.example.com 

获取主页上的所有文件.对于某些网站,它可以工作,但在大多数情况下,它只下载 index.html.我试过 wget -r 命令,但它不起作用.有谁知道如何获取页面上的所有文件,或者只是给我一个页面上的文件列表和相应的网址?

to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?

推荐答案

Wget 还可以下载整个网站.但是因为这会给服务器带来沉重的负担,所以 wget 会服从 robots.txt 文件.

Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.

 wget -r -p http://www.example.com

-p 参数告诉 wget 包含所有文件,包括图像.这意味着所有的 HTML 文件都将按照它们的方式进行.

The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.

如果你不想让 wget 服从 robots.txt 文件怎么办?您可以像这样简单地将 -e robots=off 添加到命令中:

So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:

 wget -r -p -e robots=off http://www.example.com

由于许多站点不允许您下载整个站点,因此它们会检查您的浏览器身份.要解决这个问题,请使用我上面解释的 -U mozilla.

As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.

 wget -r -p -e robots=off -U mozilla http://www.example.com

许多网站所有者不会喜欢您下载他们整个网站的事实.如果服务器发现您正在下载大量文件,它可能会自动将您添加到它的黑名单中.解决这个问题的方法是在每次下载后等待几秒钟.使用 wget 执行此操作的方法是包含 --wait=X(其中 X 是秒数.)

A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)

您还可以使用参数:--random-wait 让 wget 选择一个随机的等待秒数.要将其包含在命令中:

you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:

wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com

这篇关于为什么wget只下载某些网站的index.html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆