在抓捕文件的弹性搜索映射中面临的问题 [英] Facing issue in elasticsearch mapping of nutch crawled document

查看:123
本文介绍了在抓捕文件的弹性搜索映射中面临的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我们的应用程序中有两个数据存储引擎。


  1. MySql


  2. 弹性搜索


  3. <我知道我有10个URL存储在mysql数据库的urls表中。现在我想在运行时从表中获取这些URL,并将它们写入seed,txt进行爬行。我已经将所有这些url写入了需要,一次性输入txt。现在我的爬行开始,然后我在一个索引中的弹性搜索里面索引这些文档(让我们说url索引)。但是我想在弹性搜索索引中保留一个引用,以便我可以获取特定的url的爬网细节用于分析目的,因为弹性搜索索引只包含爬网数据。对于例如



    我的mysql表格结构是:



    表格网址:



    id url






    1 www.google.com



    我想要的弹性搜索索引映射是:



    索引url:



    {
    _id:www.google.com,
    type:doc,
    content:Hello world
    url_id:1,



    }



    这里url_id是urls表格内抓取的网址的id列的字段值。



    我可以为每个网址创建单独的索引,但该解决方案不是理想的,因为在一天结束时,我将拥有多个索引。所以如何在爬行后实现这一点。我必须修改弹性搜索索引器。我正在使用nutch 1.12和elastichsearch 1.7.1。任何帮助将不胜感激。

    解决方案

    您应该将url_id传递为您的种子列表中的其他元数据,并使用urlmeta和index-metadata插件,以便Key / Value被传递到outlink(如有必要)或至少可用于索引。



    有关如何索引元标签的说明,请参阅 Nutch WIKI


    Facing some serious issues while using nutch and elasticsearch for crawling purpose.

    We have two data storage engines in our App.

    1. MySql

    2. Elasticsearch

    Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.

    My table structure in mysql is :

    Table Url:

    id url


    1 www.google.com

    Elasticsearch index mapping I want is :

    Index url:

    { _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }

    Here url_id is the field value of id column of the crawled url inside urls table.

    I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.

    解决方案

    You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.

    See Nutch WIKI for an explanation of how to index metatags.

    这篇关于在抓捕文件的弹性搜索映射中面临的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆