Solr ExtractingRequestHandler提取"rect"消息.在链接中 [英] Solr ExtractingRequestHandler extracting "rect" in links

查看:86
本文介绍了Solr ExtractingRequestHandler提取"rect"消息.在链接中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在利用solr ExtractingRequestHandler提取HTML内容并为其编制索引.我的问题涉及它产生的提取的链接部分.返回的提取内容在HTML源代码中不存在的地方插入了矩形".

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source.

我的solrconfig单元配置如下:

I have my solrconfig cell configuration as follows:

  <requestHandler name="/upate/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.div">ignored_</str>
</lst>

我的solr schema.xml带有以下名称:

And my solr schema.xml with the following etnries:

   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="meta" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="content_encoding" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

我将以下HTML张贴到单元格中

I post the following HTML to sorl cell:

<!DOCTYPE html>
<html>
<body>
  <h1>Heading1</h1><a href="http://www.google.com">Link to Google</a><a href=
  "http://www.google.com">Link to Google2</a><a href="http://www.google.com">Link to
  Google3</a><a href="http://www.google.com">Link to Google</a>

  <p>Paragraph1</p>
</body>
</html>

Solr具有以下索引:

Solr has the following indexed:

      {
    "meta": [
      "Content-Encoding",
      "ISO-8859-1",
      "ignored_hbaseindexer_mime_type",
      "text/html",
      "Content-Type",
      "text/html; charset=ISO-8859-1"
    ],
    "links": [
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com",
      "rect",
      "http://www.google.com"
    ],
    "content_encoding": "ISO-8859-1",
    "content_type": [
      "text/html; charset=ISO-8859-1"
    ],
    "content": [
      "             Heading1  Link to Google  Link to Google2  Link to Google3  Link to Google  Paragraph1   "
    ],
    "id": "row69",
    "_version_": 1461665607851180000
  }

注意每个链接之间的矩形".solr cell或tika为什么要插入这些?我没有定义要使用的tika配置文件.我需要配置蒂卡吗?

Notice the "rect" between every link. Why is solr cell or tika inserting these? I am not defining a tika config file to use. Do i need to configure tika?

推荐答案

尽管是一个老问题,但我在通过Solr 8.7.0为HTML文档建立索引时也遇到了这个问题.

Although an old Question, I also encountered this issue while indexing HTML documents via Solr 8.7.0.

<requestHandler name="/update/extract" 
    class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    ....

HTML:

<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>

结果:

My website is rect https://buriedtruth.com/ BuriedTruth.com .

[我正在Linux命令行上发布/编制索引: solr restart;睡觉1;-c入门指南/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html; ]

我grepped(ripgrep: rg --color = always -w -e'rect'.| less )该单词的Solr代码,但是什么也没找到,所以的来源在索引网址中使用rect http ... 可以使我难以理解.

I grepped (ripgrep: rg --color=always -w -e 'rect' . |less) the Solr code for that word, but found nothing, so the source of rect http... in indexed URLs eludes me.

我的解决方案是在我的 solrconfig.xml 中添加一个正则表达式处理器:

My solution was to add a regex processor to mysolrconfig.xml:

  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <!-- ======================================== -->
    <!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
    <!-- Solr bug? URLs parse as "rect https..."  Managed-schema (Admin UI): defined p as text_general -->
    <!-- but did not parse. Looking at content | title: text_general copied to string, so added  -->
    <!-- copyfield of p (text_general) as p_str ... regex below now works! -->
    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">rect http</str>
      <str name="replacement">http</str>
      <bool name="literalReplacement">true</bool>
    </processor>
    <!-- ======================================== -->
    <!-- This needs to be last (may need to clear documents and reindex to see changes, e.g. Solr Admin UI): -->
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

正如在该处理器中的评论所暗示的那样,我正在将< p/> 格式的HTML内容提取到 p 字段( field:p | 类型:text_general ).

As alluded in my comments in that processor, I am extracting <p />-formatted HTML content to a p field (field: p | type: text_general).

该内容未使用 RegexReplaceProcessorFactory 处理器进行解析.

That content did not parse with the RegexReplaceProcessorFactory processor.

在Solr Admin UI中,我注意到 title content 被复制为字符串(例如: field:content | type:text_general | 复制到:content_str ),所以我创建了副本字段( p >> p_str )正则表达式问题.

In the Solr Admin UI I noted that title and content were copied as strings (e.g.: field: content | type: text_general | copied to: content_str), so I made copy field (p >> p_str) that resolved the regex issue.

为完整起见,这是我的 solrconfig.xml 中与HTML文档索引相关的相关部分,

For completeness, here are the relevant parts of my solrconfig.xml related to HTML document indexing,

  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
                  <!-- startup="lazy" -->

  <requestHandler name="/update/extract"
                  class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <str name="capture">div</str>
      <str name="fmap.div">div</str>
      <str name="capture">p</str>
      <str name="fmap.p">p</str>
    </lst>
  </requestHandler>

...再次指出,我是通过Solr Admin UI将字段添加到 managed-schema 的.

... noting again that I added fields to the managed-schema via the Solr Admin UI.

结果:

My website is https://buriedtruth.com/ BuriedTruth.com .


  <field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
  <copyField source="p" dest="p_str"/>

另请参阅:

  • re:< requestHandler name ="/update/extract" ... :

当我从Solr的 managed-schema 切换到经典的时,我的回答(与上面与 updateRequestProcessorChain/> 相关的特殊性)> schema.xml

My answer here (which deals with pecularities associated with the updateRequestProcessorChain />, above) when switching from Solr's managed-schema to the classic schema.xml

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆