尝试索引PDF时弹性搜索解析异常错误 [英] Elasticsearch Parse Exception error when attempting to index PDF

查看：201 发布时间：2017/8/6 23:30:39 pdf base64 elasticsearch apache-tika osx-server

本文介绍了尝试索引PDF时弹性搜索解析异常错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我刚刚开始弹性搜索。我们的要求是需要索引成千上万的PDF文件，而且我很难得到其中一个成功的索引。

安装了附件类型插件回复：安装的映射器附件。

跟随附件类型在操作教程，但进程挂起，我不知道如何解释错误信息。还尝试了挂在同一个地方的 gist 。

$ curl -X POSTlocalhost：9200 / test / attachment /-d json.file {error：ElasticSearchParseException [ offset = 0，length = 9）：[106,115,111,110,46,102,105,108,101]]，status：400} pre>

更多细节：

json.file 包含一个嵌入的Base64 PDF文件（按照说明）。文件的第一行出现正确（对我来说）： {file：JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8 ...

我不知道如果可能的 json.file 是无效的，或者如果可能弹性搜索没有设置来正确解析PDF？！？

编码 - 以下是将PDF编码为 json.file （根据教程）：

  coded =`cat fn6742.pdf | perl -MMIME :: Base64 -ne'print encode_base64 （$ _）'
 json ={\file\：\$ {coded} \}
 echo$ json> json.file

还尝试了：

  coded =`openssl base64 -in fn6742.pdf

log：

  [2012-06-07 12：32：16,742] [DEBUG] [action.index] [贝利，保罗] [测试] [0] ，节点[AHLHFKBWSsuPnTIRVhNcuw]，[P]，s [STARTED]：无法执行[index {[test] [attachment] [DauMB-vtTIaYGyKD4P8Y_w]，source [json。文件]}] 
 org.elasticsearch.ElasticSearchParseException：无法从（offset = 0，length = 9）导出xcontent：[106,115,111,110,46,102,105,108,101] 
在org.elasticsearch.common.xcontent.XContentFactory.xContent（XContentFactory.java:147）
在org.elasticsearch.common.xcontent.XContentHelper.createParser（XContentHelper.java:50）
在org .elasticsearch.index.mapper.DocumentMapper.parse（DocumentMapper.java:451）
在org.elasticsearch.index.mapper.DocumentMapper.parse（DocumentMapper.java:437）
在org.elasticsearch.index .shard.service.InternalIndexShard.prepareCreate（InternalIndexShard.java:290）
在org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary（TransportIndexAction.java:210）
在org.elasticsearch.action.support .replication.TransportShardReplicationOperationAction $ AsyncShardOperationAction.performOnPrimary（TransportShardReplicationOperationAction.java:532）
在org.elasticsearch.action.support.r eplication.TransportShardReplicationOperationAction $ AsyncShardOperationAction $ 1.run（TransportShardReplicationOperationAction.java:430）
在java.util.concurrent.ThreadPoolExecutor $ Worker.runTask（ThreadPoolExecutor.java:886）
在java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:908）
在java.lang.Thread.run（Thread.java:680）

$ b希望有人能帮我看看我失踪或错了什么？

解决方案

以下错误指出问题的根源。无法从（offset = 0，length = 9）导出xcontent：[106，115，111，110，46，...] 102,105,108,101]

UTF-8代码[106，115，111， ..]显示您正在尝试索引字符串json.file而不是文件的内容。

要为文件索引内容，只需在文件名前添加@。

  curl -X POSTlocalhost：9200 / test / attachment /-d @ json.file

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully.

Installed the Attachment Type plugin and got response: Installed mapper-attachments.

Followed the Attachment Type in Action tutorial but the process hangs and I don't know how to interpret the error message. Also tried the gist which hangs in the same place.

$ curl -X POST "localhost:9200/test/attachment/" -d json.file 
{"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]]","status":400}

More details:

The json.file contains an embedded Base64 PDF file (as per instructions). The first line of the file appears correct (to me anyway): {"file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8...

I'm not sure if maybe the json.file is invalid or if maybe elasticsearch just isn't set up to parse PDFs properly?!?

Encoding - Here's how we're encoding the PDF into json.file (as per tutorial):

coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file

also tried:

coded=`openssl base64 -in fn6742.pdf

log:

[2012-06-07 12:32:16,742][DEBUG][action.index             ] [Bailey, Paul] [test][0], node[AHLHFKBWSsuPnTIRVhNcuw], [P], s[STARTED]: Failed to execute [index {[test][attachment][DauMB-vtTIaYGyKD4P8Y_w], source[json.file]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
    at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:147)
    at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:50)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:437)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:290)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:210)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:680)

Hoping someone can help me see what I'm missing or did wrong?

解决方案

The following error points to the source of the problem.

Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]

The UTF-8 codes [106, 115, 111, ...] show that you are trying to index string "json.file" instead of content of the file.

To index content of the file simply add letter "@" in front of the file name.

curl -X POST "localhost:9200/test/attachment/" -d @json.file

这篇关于尝试索引PDF时弹性搜索解析异常错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

尝试索引PDF时弹性搜索解析异常错误 [英] Elasticsearch Parse Exception error when attempting to index PDF

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

尝试索引PDF时弹性搜索解析异常错误 [英] Elasticsearch Parse Exception error when attempting to index PDF

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭