为什么在状态和索引上有不同的文档计数? [英] Why do I have different document counts in status and index?

查看:95
本文介绍了为什么在状态和索引上有不同的文档计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在关注Storm-Crawler-ElasticSearch教程,并开始使用它.

So i'm following the Storm-Crawler-ElasticSearch tutorial and playing around with it.

使用Kibana进行搜索时,我注意到索引名称状态"的命中数远远大于索引".

When Kibana is used to search I've noticed that number of hits for index name 'status' is far greater than 'index'.

示例:

在左上角,您可以看到针对<状态>索引的 846次点击 ,我认为这意味着它已经爬过846页.

On the top left, you can see that there's 846 hits for 'status' index I assume that means it has crawled through 846 pages.

现在具有'索引'索引,表明只有 31个匹配项.

我了解功能上索引和状态是不同的,因为状态仅负责链接元数据.问题在于,StormCrawler似乎正在解析许多页面,而没有为它们建立索引.

I understand that functionallyn index and status are different as status is just responsible for the link meta data. The problem is that it seem that StormCrawler is parsing through many pages and not indexing them.

因此,我希望在显示的内容上也获得与索引"相同的点击量.而不是31.

So what I would like to have is the same amount of hits on 'index' too with the content displayed. Instead of just 31.

推荐答案

状态"索引包含有关爬网程序获取或发现的所有URL的信息.这与Nutch中的crawldb大致相同.索引"索引包含已获取,解析和索引的页面.

The 'status' index contains the information about all the URLs the crawler either fetched or discovered. This is roughly the equivalent of the crawldb in Nutch.The 'index' index contains the pages that have been fetched, parsed and, well, indexed.

现在,如果您查看状态索引中的状态"字段,您会发现有不同的值指示URL是否已被发现,已被检查等.请参见

Now if you look at the 'status' field within the status index, you'll find that there are different values indicating whether a URL has been DISCOVERED, FETCHED etc... See WIKI about status stream. The ones marked as DISCOVERED haven't yet been fetched and therefore can't be in the 'index' index. If you filter the content of the status index by status:FETCHED you should see a number comparable to the target index.

SC中的Elasticsearch模块包含用于kibana的模板,使您可以查看每个状态的URL细分.如果您尚未这样做,建议您阅读视频教程在YouTube上.

The Elasticsearch module in SC contains templates for kibana that allow you to see the breakdown of URLs per status. If you haven't done so already, I'd recommend that you look at the video tutorials on Youtube.

因此,我希望在显示的内容上也获得与索引"相同的点击量.而不是31.

So what I would like to have is the same amount of hits on 'index' too with the content displayed. Instead of just 31.

它最终将到达目的地,您只需要给爬虫花些时间来完成它的工作即可(礼貌地做到这一点).请记住,搜寻器发现URL的速度比获取URL的速度更快.在询问速度之前,请阅读常见问题解答.

It will eventually get there, you just need to give time to the crawler to do its job (and do so politely). Bear in mind that a crawler discovers URLs quicker than it fetches them. Before you ask about speed, please read the FAQ.

这篇关于为什么在状态和索引上有不同的文档计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆