Solr ExtractingRequestHandler 提取“rect"在链接中 [英] Solr ExtractingRequestHandler extracting "rect" in links

查看:26
本文介绍了Solr ExtractingRequestHandler 提取“rect"在链接中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在利用 solr ExtractingRequestHandler 来提取和索引 HTML 内容.我的问题涉及它生成的提取链接部分.返回的提取内容在 HTML 源代码中不存在的位置插入了rect".

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source.

我的 solrconfig 单元配置如下:

I have my solrconfig cell configuration as follows:

  <requestHandler name="/upate/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.div">ignored_</str>
</lst>

和我的 solr schema.xml 具有以下内容:

And my solr schema.xml with the following etnries:

   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="meta" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="content_encoding" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

我将以下 HTML 发布到 sorl 单元格:

I post the following HTML to sorl cell:

<!DOCTYPE html>
<html>
<body>
  <h1>Heading1</h1><a href="http://www.google.com">Link to Google</a><a href=
  "http://www.google.com">Link to Google2</a><a href="http://www.google.com">Link to
  Google3</a><a href="http://www.google.com">Link to Google</a>

  <p>Paragraph1</p>
</body>
</html>

Solr 有以下索引:

Solr has the following indexed:

      {
    "meta": [
      "Content-Encoding",
      "ISO-8859-1",
      "ignored_hbaseindexer_mime_type",
      "text/html",
      "Content-Type",
      "text/html; charset=ISO-8859-1"
    ],
    "links": [
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com"
    ],
    "content_encoding": "ISO-8859-1",
    "content_type": [
      "text/html; charset=ISO-8859-1"
    ],
    "content": [
      "             Heading1  Link to Google  Link to Google2  Link to Google3  Link to Google  Paragraph1   "
    ],
    "id": "row69",
    "_version_": 1461665607851180000
  }

注意每个链接之间的rect".为什么 solr cell 或 tika 插入这些?我没有定义要使用的 tika 配置文件.我需要配置 tika 吗?

Notice the "rect" between every link. Why is solr cell or tika inserting these? I am not defining a tika config file to use. Do i need to configure tika?

推荐答案

虽然是一个老问题,但我在通过 Solr 8.7.0 索引 HTML 文档时也遇到了这个问题.

Although an old Question, I also encountered this issue while indexing HTML documents via Solr 8.7.0.

<requestHandler name="/update/extract" 
    class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    ....

HTML:

<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>

结果:

My website is rect https://buriedtruth.com/ BuriedTruth.com .

[ 我在 Linux 命令行上发布/索引:solr restart;睡眠 1;post -c 入门/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html; ]

我搜索了 (ripgrep: rg --color=always -w -e 'rect' . |less) 那个词的 Solr 代码,但什么也没找到,所以 的来源索引 URL 中的 rect http... 让我望而却步.

I grepped (ripgrep: rg --color=always -w -e 'rect' . |less) the Solr code for that word, but found nothing, so the source of rect http... in indexed URLs eludes me.

我的解决方案是在我的solrconfig.xml中添加一个正则表达式处理器:

My solution was to add a regex processor to mysolrconfig.xml:

  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <!-- ======================================== -->
    <!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
    <!-- Solr bug? URLs parse as "rect https..."  Managed-schema (Admin UI): defined p as text_general -->
    <!-- but did not parse. Looking at content | title: text_general copied to string, so added  -->
    <!-- copyfield of p (text_general) as p_str ... regex below now works! -->
    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">rect http</str>
      <str name="replacement">http</str>
      <bool name="literalReplacement">true</bool>
    </processor>
    <!-- ======================================== -->
    <!-- This needs to be last (may need to clear documents and reindex to see changes, e.g. Solr Admin UI): -->
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

正如我在该处理器中的评论中提到的,我将 <p/> 格式的 HTML 内容提取到 p 字段(field: p | 类型:text_general).

As alluded in my comments in that processor, I am extracting <p />-formatted HTML content to a p field (field: p | type: text_general).

该内容没有使用 RegexReplaceProcessorFactory 处理器进行解析.

That content did not parse with the RegexReplaceProcessorFactory processor.

在 Solr Admin UI 中,我注意到 titlecontent 被复制为字符串(例如:field: content | type: text_general | copied to: content_str),所以我做了复制字段 (p >> p_str) 来解决正则表达式问题.

In the Solr Admin UI I noted that title and content were copied as strings (e.g.: field: content | type: text_general | copied to: content_str), so I made copy field (p >> p_str) that resolved the regex issue.

为了完整起见,这里是我的 solrconfig.xml 中与 HTML 文档索引相关的相关部分,

For completeness, here are the relevant parts of my solrconfig.xml related to HTML document indexing,

  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
                  <!-- startup="lazy" -->

  <requestHandler name="/update/extract"
                  class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <str name="capture">div</str>
      <str name="fmap.div">div</str>
      <str name="capture">p</str>
      <str name="fmap.p">p</str>
    </lst>
  </requestHandler>

... 再次注意到我通过 Solr Admin UI 向 managed-schema 添加了字段.

... noting again that I added fields to the managed-schema via the Solr Admin UI.

结果:

My website is https://buriedtruth.com/ BuriedTruth.com .


  <field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
  <copyField source="p" dest="p_str"/>

另见:

  • re: :

https://lucene.apache.org/solr/guide/8_6/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-extractingrequesthandler-in-solrconfig-xml

当从 Solr 的 managed-schema 切换到经典的 schema.xml

My answer here (which deals with pecularities associated with the updateRequestProcessorChain />, above) when switching from Solr's managed-schema to the classic schema.xml

这篇关于Solr ExtractingRequestHandler 提取“rect"在链接中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆