如何使用摄取附件插件在Elasticsearch 5.0.0中编制一个pdf文件? [英] How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

查看:3042
本文介绍了如何使用摄取附件插件在Elasticsearch 5.0.0中编制一个pdf文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Elasticsearch的新手,我在这里阅读 https:// www .elastic.co / guide / en / elasticsearch / plugins / master / mapper-attachments.html 在弹性搜索5.0.0中不推荐使用mapper-attachments插件。



我现在尝试使用新的摄取附件插件索引pdf文件并上传附件。



我迄今为止所尝试的是

  curl -H 'Content-Type:application / pdf'-XPOST localhost:9200 / test / 1 -d @ / cygdrive / c / test / test.pdf 

但我收到以下错误:

  {error:{root_cause :{type:mapper_parsing_exception,reason:failed to parse}],type:mapper_parsing_exception,reason:failed to parse not_x_content_exception,reason:压缩器检测只能在一些xcontent字节或压缩的xcontent字节上调用}},status:400} 

我希望PDF文件将被索引和上传。我做错了什么?



我还测试了Elasticsearch 2.3.3,但是mapper-attachments插件对此版本无效,我不想使用任何旧的

解决方案

您需要确保您已经创建了 ingest 管道: / p>

  PUT _ingest / pipeline / attachment 
{
description:Extract attachment information,
processor:[
{
attachment:{
field:data,
indexed_chars:-1
}
}
]
}

然后你可以使一个使用您创建的管道将PUT 不不 POST 到您的索引。

  PUT my_index / my_type / my_id?pipeline = attachment 
{
data:e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0 =
}

在你的例子中,应该是这样的:

  curl -H'Content-Type:application / pdf'-XPUT localhost:9200 / test / 1?pipeline = -d @ / cygdrive / c / test / test.pdf 

记住PDF内容必须是base64编码。



希望它会帮助你。



编辑1
请确保阅读这些,它帮助我很多:



弹性入场



Ingest插件



采访演示



编辑2



此外,您必须安装 ingest-attachment 插件。 / p>

  ./ bin / elasticsearch-plugin install ingest-attachment 

编辑3



请先创建您的摄制处理器(附件),创建您的索引地图与您将使用的字段,并确保您的地图中的数据字段(附件处理器中的字段的相同名称),所以摄取将使用您的pdf内容处理和填满您的数据字段。



我在...中插入了 indexed_chars 选项摄取处理器,具有 -1 值,因此您可以索引大型pdf文件。



编辑4



映射应该是这样的:

  PUT my_index 
{
mappings:{
my_type:{
properties:{
attachment.data:{
type:text ,
分析zer:brazilian
}
}
}
}
}

在这种情况下,我使用巴西过滤器,但您可以删除或使用您自己的。


I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

解决方案

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Elastic Ingest

Ingest Plugin

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own.

这篇关于如何使用摄取附件插件在Elasticsearch 5.0.0中编制一个pdf文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆