Solr ExtractingRequestHandler提取"rect"消息.在链接中 [英] Solr ExtractingRequestHandler extracting "rect" in links
问题描述
我正在利用solr ExtractingRequestHandler提取HTML内容并为其编制索引.我的问题涉及它产生的提取的链接部分.返回的提取内容在HTML源代码中不存在的地方插入了矩形".
I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source.
我的solrconfig单元配置如下:
I have my solrconfig cell configuration as follows:
<requestHandler name="/upate/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.div">ignored_</str>
</lst>
我的solr schema.xml带有以下名称:
And my solr schema.xml with the following etnries:
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="meta" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content_encoding" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
我将以下HTML张贴到单元格中
I post the following HTML to sorl cell:
<!DOCTYPE html>
<html>
<body>
<h1>Heading1</h1><a href="http://www.google.com">Link to Google</a><a href=
"http://www.google.com">Link to Google2</a><a href="http://www.google.com">Link to
Google3</a><a href="http://www.google.com">Link to Google</a>
<p>Paragraph1</p>
</body>
</html>
Solr具有以下索引:
Solr has the following indexed:
{
"meta": [
"Content-Encoding",
"ISO-8859-1",
"ignored_hbaseindexer_mime_type",
"text/html",
"Content-Type",
"text/html; charset=ISO-8859-1"
],
"links": [
"rect",
"http://www.google.com",
"rect",
"http://www.google.com",
"rect",
"http://www.google.com",
"rect",
"http://www.google.com"
],
"content_encoding": "ISO-8859-1",
"content_type": [
"text/html; charset=ISO-8859-1"
],
"content": [
" Heading1 Link to Google Link to Google2 Link to Google3 Link to Google Paragraph1 "
],
"id": "row69",
"_version_": 1461665607851180000
}
注意每个链接之间的矩形".solr cell或tika为什么要插入这些?我没有定义要使用的tika配置文件.我需要配置蒂卡吗?
Notice the "rect" between every link. Why is solr cell or tika inserting these? I am not defining a tika config file to use. Do i need to configure tika?
推荐答案
尽管是一个老问题,但我在通过Solr 8.7.0为HTML文档建立索引时也遇到了这个问题.
Although an old Question, I also encountered this issue while indexing HTML documents via Solr 8.7.0.
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
....
HTML:
<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>
结果:
My website is rect https://buriedtruth.com/ BuriedTruth.com .
[我正在Linux命令行上发布/编制索引: solr restart;睡觉1;-c入门指南/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html;
]
我grepped(ripgrep: rg --color = always -w -e'rect'.| less
)该单词的Solr代码,但是什么也没找到,所以的来源在索引网址中使用rect http ...
可以使我难以理解.
I grepped (ripgrep: rg --color=always -w -e 'rect' . |less
) the Solr code for that word, but found nothing, so the source of rect http...
in indexed URLs eludes me.
我的解决方案是在我的 solrconfig.xml
中添加一个正则表达式处理器:
My solution was to add a regex processor to mysolrconfig.xml
:
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<!-- ======================================== -->
<!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
<!-- Solr bug? URLs parse as "rect https..." Managed-schema (Admin UI): defined p as text_general -->
<!-- but did not parse. Looking at content | title: text_general copied to string, so added -->
<!-- copyfield of p (text_general) as p_str ... regex below now works! -->
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">rect http</str>
<str name="replacement">http</str>
<bool name="literalReplacement">true</bool>
</processor>
<!-- ======================================== -->
<!-- This needs to be last (may need to clear documents and reindex to see changes, e.g. Solr Admin UI): -->
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
正如在该处理器中的评论所暗示的那样,我正在将< p/>
格式的HTML内容提取到 p
字段( field:p
| 类型:text_general
).
As alluded in my comments in that processor, I am extracting <p />
-formatted HTML content to a p
field (field: p
| type: text_general
).
该内容未使用 RegexReplaceProcessorFactory
处理器进行解析.
That content did not parse with the RegexReplaceProcessorFactory
processor.
在Solr Admin UI中,我注意到 title
和 content
被复制为字符串(例如: field:content
| type:text_general
| 复制到:content_str
),所以我创建了副本字段( p
>> p_str
)正则表达式问题.
In the Solr Admin UI I noted that title
and content
were copied as strings (e.g.: field: content
| type: text_general
| copied to: content_str
), so I made copy field (p
>> p_str
) that resolved the regex issue.
为完整起见,这是我的 solrconfig.xml
中与HTML文档索引相关的相关部分,
For completeness, here are the relevant parts of my solrconfig.xml
related to HTML document indexing,
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
<!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
<!-- startup="lazy" -->
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="capture">div</str>
<str name="fmap.div">div</str>
<str name="capture">p</str>
<str name="fmap.p">p</str>
</lst>
</requestHandler>
...再次指出,我是通过Solr Admin UI将字段添加到 managed-schema
的.
... noting again that I added fields to the managed-schema
via the Solr Admin UI.
结果:
My website is https://buriedtruth.com/ BuriedTruth.com .
<field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<copyField source="p" dest="p_str"/>
另请参阅:
-
re:
< requestHandler name ="/update/extract" ...
:
当我从Solr的 managed-schema
切换到经典的时,我的回答(与上面与
updateRequestProcessorChain/>
相关的特殊性)> schema.xml
My answer here (which deals with pecularities associated with the updateRequestProcessorChain />
, above) when switching from Solr's managed-schema
to the classic schema.xml