Solr ExtractingRequestHandler为pdf文档提供空内容 [英] Solr ExtractingRequestHandler giving empty content for pdf documents

查看：111 发布时间：2020/5/25 4:43:34 pdf solr apache-tika solr-cell

本文介绍了Solr ExtractingRequestHandler为pdf文档提供空内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在Solr中使用ExtractingRequestHandler来获取文档内容并为其编制索引.它适用于所有Microsoft文档，但对于PDF，要提取的内容为空.我还尝试了curl的extractOnly = true，它也只返回空的正文.

I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body.

我在相同的文档上独立使用了TIKA，并且提取的内容还不错.区别在于，当我独立进行操作时，我使用的是Tika随附的BodyContentHander，而不是Solr使用的SolrContentHandler.有人看到过吗?

I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is used by Solr. Has anybody seen this?

与使用Tika在Solr之外提取内容相比，我真的更愿意让Solr处理它.

I would really rather let Solr handle it than me using Tika to extract content outside of Solr.

Solr ExtractingRequestHandler为pdf文档提供空内容 [英] Solr ExtractingRequestHandler giving empty content for pdf documents

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Solr ExtractingRequestHandler为pdf文档提供空内容 [英] Solr ExtractingRequestHandler giving empty content for pdf documents

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭