使用 Apache Solr 检索提取的文本 [英] Retrieving extracted text with Apache Solr
问题描述
我是 Apache Solr 的新手,我想用它来索引 pdf 文件.到目前为止,我设法启动并运行了它,现在我可以搜索添加的 pdf 文件.
I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files.
但是,我需要能够从结果中检索搜索到的文本.
However, I need to be able to retrieve the searched text from the results.
我在默认的 solrconfig.xml 中找到了一个与此相关的 xml 片段:
I found an xml snippet in the default solrconfig.xml concerning exactly that:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
从我从这里得到的信息 (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika),我想我必须向模式添加一个新字段.xml(例如内容")已存储=真"和索引=真".但是,我不确定如何准确地完成此操作?
From what I get from here (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika), I think I have to add a new field to schema.xml (e.g. "content") that has stored="true" and indexed="true". However, I'm not really sure how to accomplish this exactly?
感谢任何帮助,谢谢
推荐答案
添加如下所示的 schema.xml:
add a schema.xml looking like this:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="whatever" version="1.2">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="../../mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="../../mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="internal_id" type="string" indexed="true" stored="true"/>
<field name="cat" type="int" indexed="true" stored="true"/>
<field name="desc" type="text" indexed="true" stored="true"/>
</fields>
<uniqueKey>internal_id</uniqueKey>
<defaultSearchField>desc</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<similarity class="org.apache.lucene.search.DefaultSimilarity"/>
</schema>
如果字段"是已存储",则默认情况下会显示在结果中.
If the "field" is "stored", it will show up in the results, by default.
这篇关于使用 Apache Solr 检索提取的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!