使用 wget 伪造浏览器? [英] Using wget to fake browser?

查看:44
本文介绍了使用 wget 伪造浏览器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一个网站以构建其站点地图.

I'd like to crawl a web site to build its sitemap.

问题是,该站点使用 htaccess 文件来阻止蜘蛛,因此以下命令仅下载主页(index.html)并停止,尽管它确实包含指向其他页面的链接:

Problems is, the site uses an htaccess file to block spiders, so the following command only downloads the homepage (index.html) and stops, although it does contain links to other pages:

wget -mkEpnp -e robots=off -U Mozilla http://www.acme.com

由于我使用浏览器访问网站的其余部分没有问题,因此我认为-e robots=off -U Mozilla"选项不足以让 wget 假装它是浏览器.

Since I have no problem accessing the rest of the site with a browser, I assume the "-e robots=off -U Mozilla" options aren't enough to have wget pretend it's a browser.

还有其他我应该知道的选择吗?wget 自己处理 cookie 吗?

Are there other options I should know about? Does wget handle cookies by itself?

谢谢.

--

我将这些添加到 wget.ini,但无济于事:

I added those to wget.ini, to no avail:

hsts=0
robots = off
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /

--

找到了.

主页中链接的页面位于远程服务器上,因此 wget 会忽略它们.只需添加--span-hosts"来告诉 wget 去那里,如果你想限制蜘蛛到那个域,-D www.remote.site.com".

The pages linked to in the homepage were on a remote server, so wget would ignore them. Just add "--span-hosts" to tell wget to go there, and "-D www.remote.site.com" if you want to restrict spidering to that domain.

推荐答案

您可能希望将 User-Agent 设置为不仅仅是 Mozilla,例如:

you might want to set the User-Agent to something more than just Mozilla, something like:

wget --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

这篇关于使用 wget 伪造浏览器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆