检索提取的文本使用Apache Solr实现 [英] Retrieving extracted text with Apache Solr

查看：291 发布时间：2016/5/21 13:22:58 apache cell solr apache-tika

本文介绍了检索提取的文本使用Apache Solr实现的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新来的Apache Solr实现，我想用它来索引PDF文件。我设法得到它，到目前为止运行，我现在可以搜索添加的PDF文件。

I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files.

不过，我需要能够检索结果搜索到的文本。

However, I need to be able to retrieve the searched text from the results.

我发现在默认solrconfig.xml中正是关于一个XML片段：

I found an xml snippet in the default solrconfig.xml concerning exactly that:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
  <!-- All the main content goes into "text"... if you need to return
       the extracted text or do highlighting, use a stored field. -->
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>

从我从这里（http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika），我想我有一个新的字段添加到架构的.xml（例如，内容）已存储=true，并收录=真。不过，我真的不知道如何准确做到这一点？

From what I get from here (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika), I think I have to add a new field to schema.xml (e.g. "content") that has stored="true" and indexed="true". However, I'm not really sure how to accomplish this exactly?

任何帮助AP preciated，THX

any help appreciated, thx

推荐答案

添加一个schema.xml中看起来像这样：

add a schema.xml looking like this:

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="whatever" version="1.2">
    <types>
        <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
        <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
        <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <charFilter class="solr.MappingCharFilterFactory" mapping="../../mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <charFilter class="solr.MappingCharFilterFactory" mapping="../../mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>
    </types>
    <fields>
        <field name="internal_id" type="string" indexed="true" stored="true"/>
        <field name="cat" type="int" indexed="true" stored="true"/>
        <field name="desc" type="text" indexed="true" stored="true"/>
    </fields>
    <uniqueKey>internal_id</uniqueKey>
    <defaultSearchField>desc</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
    <similarity class="org.apache.lucene.search.DefaultSimilarity"/>
</schema>

如果在场是保存，它会显示在结果中，默认情况下。

If the "field" is "stored", it will show up in the results, by default.

这篇关于检索提取的文本使用Apache Solr实现的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检索提取的文本使用Apache Solr实现 [英] Retrieving extracted text with Apache Solr

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

检索提取的文本使用Apache Solr实现 [英] Retrieving extracted text with Apache Solr

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭