Nutch不会抓取除seed.txt中指定的URL以外的URL [英] Nutch not crawling URLs except the one specified in seed.txt

查看:140
本文介绍了Nutch不会抓取除seed.txt中指定的URL以外的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Apache Nutch 1.12,而我尝试抓取的URL类似于 https://www.mywebsite.com/abc-def/这是seed.txt文件中的唯一条目.由于我不希望爬网URL中没有"abc-def"的任何页面,因此我将以下行放在regex-urlfilter.txt中:

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt :

+^https://www.mywebsite.com/abc-def/(.+)*$

当我尝试运行以下爬网命令时:

When I try to run the following crawl command :

**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**

它仅对一个seed.txt网址进行爬取和索引,而在第二次迭代中,它只是说:

It crawl and index just one seed.txt url and in 2nd iteration it just say:

Generator: starting at 2017-02-28 09:51:36

Generator: Selecting best-scoring urls due for fetch.

Generator: filtering: false

Generator: normalizing: true

Generator: topN: 50000

Generator: 0 records selected for fetching, exiting ...

Generate returned 1 (no new segments created)

Escaping loop: no more URLs to fetch now

当我更改regex-urlfilter.txt以允许所有内容(+.)时,它开始为 https:/上的每个URL编制索引./www.mywebsite.com 当然是我不想要的.

When I change the regex-urlfilter.txt to allow everything(+.) it started indexing every URL on https://www.mywebsite.com which certainly I don't want.

如果有人碰巧遇到相同的问题,请分享您的解决方法.

If anyone happen to have the same problem, please share how you get past it.

推荐答案

在过去2天尝试了多种方法后,该方法可以正常工作.以下是解决方法:

Got that working after trying multiple things in last 2 days.Here is the solution:

由于我正在爬网的网站非常沉重,因此nutch-default.xml中的属性将其截断为65536字节(默认).不幸的是,我要爬网的链接未包含在所选部分中,因此导致了并不是在抓取它.当我通过在nutch-site.xml中将以下值更改为无限制时,它开始抓取我的页面:

Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

这篇关于Nutch不会抓取除seed.txt中指定的URL以外的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆