使用wget和限制抓取链接的总数抓取网站 [英] Crawl website using wget and limit total number of crawled links

查看:144
本文介绍了使用wget和限制抓取链接的总数抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过玩弄wget的工具,以了解更多有关抓取。我很感兴趣,我爬部门的网站,发现该网站上的第100个链接。到目前为止,下面的命令是我。我如何限制履带后100个链接停止?

  wget的-r -o output.txt的-l 0 -t 1 --spider -w 5 -A HTML -e =机器人的http://www.example.com 


解决方案

您不能。 wget的不支持此所以如果你想这样的事情,你就必须自己编写的工具。

您可以获取主文件,手动解析的联系,并与100个项目的限制取逐一。但它不是东西,wget的支持。

您可以看看的HTTrack的网站上检索过,它有这不少额外的选项:的http:/ /www.httrack.com/

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

解决方案

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/

这篇关于使用wget和限制抓取链接的总数抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆