pdf文件中的Solr查询未返回突出显示的内容 [英] Solr query in a pdf file, is not returning highlighting content

查看:116
本文介绍了pdf文件中的Solr查询未返回突出显示的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我今天在我的debian服务器上实现了solr 6.5.1,但是我很难获取pdf文本内容.可以进行搜索,因为当我查询自己的名字"juan"时,文档可以很好地显示在文档中.但是,并没有与每个str结果一起出现.

I have implemented solr 6.5.1 today in my debian server but I have trouble getting the pdf text content. The searching is ok, because the document appears ok in when I query for example my name: "juan". However, the does not appear with each str result how it supposed to do.

这是示例查询:

这是结果:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="hl.snippets">20</str>
            <str name="q">juan</str>
            <str name="hl">true</str>
            <str name="fl">title</str>
            <str name="hl.usePhraseHighlighter">true</str>
            <str name="hl.fl">content</str>
            <str name="wt">xml</str>
        </lst>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <arr name="title">
                <str>CV_Juan_Jara_ultimo</str>
            </arr>
        </doc>
    </result>
    <lst name="highlighting">
        <lst name="/solr-6.5.1/mydocs/CV_Juan_Jara_ultimo.pdf"/>
    </lst>
</response>

此外,该日志显示了所有pdf文本,因此我认为它已正确索引(我使用以下命令对pdf进行了索引: bin/post -c ex mydocs/CV_Juan_Jara_ultimo.pdf ).

Additionally, the log is showing all the pdf text, so I assume it was correctly indexed (I indexed the pdf using the command: bin/post -c ex mydocs/CV_Juan_Jara_ultimo.pdf).

我使用curl将内容"字段添加到架构中:

I added the "content" field to the schema, using curl:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field" : {
     "name":"text",
     "type":"text_general",
     "indexed":"true",
     "stored":"false",
     "multiValued":"true"
     }
}' localhost:8983/solr/ex/schema

你知道怎么了吗?

我要做的就是在pdf中搜索一个主题,然后像这样突出显示所有结果:

All that I want to do is search a topic in my pdf and then get all results highlighted like this:

推荐答案

已解决:最终对我有用的解决方案是用以下curl命令替换架构中的_text_字段:

SOLVED: the solution that worked for me finally, was to replace the _text_ field in schema with this curl command:

curl -X POST -H 'Content-type:application/json' --data-binary '{
 "replace-field" : {
 "name":"_text_",
 "type":"text_general",
 "indexed":"true",
 "stored":"true",
 "multiValued":"true"
 }
}' http://localhost:8983/solr/ex/schema

这是因为_text_字段默认情况下带有"stored":"false".

This is because the _text_ field comes with "stored":"false" by default.

注意:请记住,如果在替换此架构字段之前已将所有文件重新索引到您的核心,则

NOTE: Remember to indexing all files again to your core if you did it prior to this schema field replace

这篇关于pdf文件中的Solr查询未返回突出显示的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆