如何使用摄取附件插件在 Elasticsearch 5.0.0 中索引 pdf 文件? [英] How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

查看:33
本文介绍了如何使用摄取附件插件在 Elasticsearch 5.0.0 中索引 pdf 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Elasticsearch 的新手,我在这里阅读了 https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html 在 elasticsearch 5.0.0 中不推荐使用 mapper-attachments 插件.

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

我现在尝试使用新的摄取附件插件为 pdf 文件编制索引并上传附件.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

到目前为止我尝试过的是

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

但我收到以下错误:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

我希望 pdf 文件将被索引并上传.我做错了什么?

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

我还测试了 Elasticsearch 2.3.3,但 mapper-attachments 插件对该版本无效,我不想使用任何旧版本的 Elasticsearch.

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

推荐答案

你需要确保你已经创建了你的摄取管道:

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

然后您可以使用您创建的管道对索引进行PUT而不是POST.

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

在你的例子中,应该是这样的:

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

记住 PDF 内容必须是 base64 编码的.

Remembering that the PDF content must be base64 encoded.

希望能帮到你.

编辑 1请务必阅读这些,它对我帮助很大:

Edit 1 Please make sure to read these, it helped me a lot:

弹性摄取

摄取插件

摄取演示

编辑 2

此外,您必须安装 ingest-attachment 插件.

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

编辑 3

请在创建您的摄取处理器(附件)之前,创建您的索引映射与您将使用的字段并确保您的地图中有数据字段(与附件处理器中的字段"同名),因此摄取将处理并填充您的数据 包含您的 pdf 内容的字段.

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

我在摄取处理器中插入了 indexed_chars 选项,值为 -1,因此您可以索引大型 pdf 文件.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

编辑 4

映射应该是这样的:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

在这种情况下,我使用 brazilian 过滤器,但您可以删除它或使用自己的过滤器.

In this case, I use the brazilian filter, but you can remove that or use your own.

这篇关于如何使用摄取附件插件在 Elasticsearch 5.0.0 中索引 pdf 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆