SOLR / LUCENE专家,请帮我设计一个简单的关键字从PDF索引搜索? [英] SOLR/LUCENE Experts, please help me design a simple keyword search from PDF index?

查看:156
本文介绍了SOLR / LUCENE专家,请帮我设计一个简单的关键字从PDF索引搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有:

我曾经用过solr,但找不到一种方法来定制它。 p>

一堆PDF文件。
一组关键字。



我试图达到的目标:

索引PDF文件(solrcell - done)
搜索关键字(可以正常工作)
定制输出以清除PDF文件的名称,出现关键字的摘录(无关紧要) / idea如何)



尝试操作ResponseHandler / Schema.xml / Solrconfig.xml无济于事。



Lucene / solr高手,你认为我想达到什么目的吗?



我把我的github上的现有代码@ https://github.com/ThinkCode/solr_search (这主要是solr的默认示例对所有字段进行细微修改(所有内容都存储在一个内容字段中)。
$ b

schema.xml中的显着变化是:

Schema.xml:

 < solrQueryParser defaultOperator =AND/> 
$



< solrQueryParser defaultOperator =AND/>

< copyField source =*dest =content/>

当前输出:


(query)
http:// localhost:8983 / solr / select /?q = Java + Servlet& version = 2.2& start = 0& rows = 10& indent = on




 < response>< lst name =responseHeader>< int name =status < / str>< />< / /> 0< / int>< int name =QTime> 13< / int>< lst name =params>< str name =indent>< str name =start> 0< / str>< str name =q> Java Servlet< / str>< str name =version> 2.2< / str& 行 →10< / STR>< / LST>< / LST> 

< result name =responsenumFound =1start =0>< doc>< arr name =content_type>< str> application / pdf< / str>< / arr>< str name =id> tutorial.pdf< / str>< str name =subject> Solr< / str>< arr name =title> < str> Solr教程< / str>< / arr>< / doc>< / result>< / response>

我正在寻找'提取的片段(行),其中找到了关键字'。

在提供的查询中,我搜索'Java Servlet'并返回文档。我感兴趣的是上下文'Solr可以在你选择的任何Java Servlet容器中运行',并在输出xml中返回。

解决方案

<要获取匹配关键字周围的文本片段,请参阅 http://wiki.apache.org/solr/突出显示参数



要获取索引PDF的文件名作为响应的一部分,只需添加一个包含该信息的字段(它应该是一个字符串字段,非索引,存储)。当然,你必须在索引时填充这个新字段。


I dabbled with solr but couldn't figure out a way to tailor it to my reuqirement.

What I have :

A bunch of PDF files. A set of keywords.

What I am trying to achieve :

Index the PDF files (solrcell - done) Search for a keyword (works ok) Tailor the output to spit out the names of the PDF files, an excerpt where the keyword occurred (No clue/idea how to)

Tried manipulating ResponseHandler/Schema.xml/Solrconfig.xml to no avail.

Lucene/solr experts, do you think what I am trying to achieve is possible?

I put my existing code on github @ https://github.com/ThinkCode/solr_search (which is mostly solr's default example with minor modifications to the fields (all the content is stored in one content field).

Notable changes in schema.xml being :

Schema.xml :

<solrQueryParser defaultOperator="AND"/>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <dynamicField name="*" type="string"    indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

<solrQueryParser defaultOperator="AND"/>

<copyField source="*" dest="content"/>

Current Output :

(query) http://localhost:8983/solr/select/?q=Java+Servlet&version=2.2&start=0&rows=10&indent=on

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>

<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>

What I am looking for is 'extracted fragment (line) where the keyword was found'.

In the query provided, I search for 'Java Servlet' and it returned the document. I am interested in the context 'Solr can run in any Java Servlet Container of your choice' to be returned in the output xml.

解决方案

To get snippets of text around the matched keywords, see http://wiki.apache.org/solr/HighlightingParameters

To get the filename of the indexed PDF as part of the response, simply add a field with that information (it should be a string field, non-indexed, stored). Of course, you have to populate this new field at index-time.

这篇关于SOLR / LUCENE专家,请帮我设计一个简单的关键字从PDF索引搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆