Solr ExtractingRequestHandler 为 pdf 文档提供空内容 [英] Solr ExtractingRequestHandler giving empty content for pdf documents

查看：26 发布时间：2021/11/14 23:46:12 pdf solr apache-tika solr-cell

本文介绍了Solr ExtractingRequestHandler 为 pdf 文档提供空内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Solr 中使用 ExtractingRequestHandler 来获取文档内容并为其编制索引.它适用于所有 Microsoft 文档，但对于 PDF，提取的内容为空.我也尝试过使用 curl 的 extractOnly=true，它也只返回空的正文.

I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body.

我在相同的文档上独立使用了 TIKA，它提取的内容很好.不同之处在于，在独立进行时，我使用的是 Tika 附带的 BodyContentHander，而不是 Solr 使用的 SolrContentHandler.有人看过吗?

I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is used by Solr. Has anybody seen this?

我真的宁愿让 Solr 处理它而不是我使用 Tika 提取 Solr 之外的内容.

I would really rather let Solr handle it than me using Tika to extract content outside of Solr.

Solr ExtractingRequestHandler 为 pdf 文档提供空内容 [英] Solr ExtractingRequestHandler giving empty content for pdf documents

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Solr ExtractingRequestHandler 为 pdf 文档提供空内容 [英] Solr ExtractingRequestHandler giving empty content for pdf documents

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭