在抓捕文件的弹性搜索映射中面临的问题 [英] Facing issue in elasticsearch mapping of nutch crawled document

查看：123 发布时间：2017/8/7 3:16:48 mysql elasticsearch web-crawler nutch

本文介绍了在抓捕文件的弹性搜索映射中面临的问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们的应用程序中有两个数据存储引擎。

MySql

弹性搜索

我的mysql表格结构是：

表格网址：

id url

1 www.google.com

我想要的弹性搜索索引映射是：

索引url：

{
_id：www.google.com，
type：doc，
content：Hello world
url_id：1，
。
。
。
}

这里url_id是urls表格内抓取的网址的id列的字段值。

我可以为每个网址创建单独的索引，但该解决方案不是理想的，因为在一天结束时，我将拥有多个索引。所以如何在爬行后实现这一点。我必须修改弹性搜索索引器。我正在使用nutch 1.12和elastichsearch 1.7.1。任何帮助将不胜感激。

解决方案

您应该将url_id传递为您的种子列表中的其他元数据，并使用urlmeta和index-metadata插件，以便Key / Value被传递到outlink（如有必要）或至少可用于索引。

有关如何索引元标签的说明，请参阅 Nutch WIKI 。

Facing some serious issues while using nutch and elasticsearch for crawling purpose.

We have two data storage engines in our App.

MySql
Elasticsearch

Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.

My table structure in mysql is :

Table Url:

id url

1 www.google.com

Elasticsearch index mapping I want is :

Index url:

{ _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }

Here url_id is the field value of id column of the crawled url inside urls table.

I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.

解决方案

You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.

See Nutch WIKI for an explanation of how to index metatags.

这篇关于在抓捕文件的弹性搜索映射中面临的问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在抓捕文件的弹性搜索映射中面临的问题 [英] Facing issue in elasticsearch mapping of nutch crawled document

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

在抓捕文件的弹性搜索映射中面临的问题 [英] Facing issue in elasticsearch mapping of nutch crawled document

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭