如何索引html内容,保持位置(如xpath,css选择器等) [英] How to index html content, keeping positions (as xpath, css selector, etc)

查看:143
本文介绍了如何索引html内容,保持位置(如xpath,css选择器等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为HTML内容创建一个全文搜索索引(更具体地说:XHTML格式的EPUB章节)。像这样:

I want to create a full-text search index for HTML content (to be more specific: EPUB chapters in XHTML format). Like this:

...
<p>Lorem ipsum <b>dolor</b> sit amet, consectetur adipiscing elit.</p>
...

问题是我需要某种方式匹配文本的位置(如xpath )与搜索结果,因为我需要将读者软件定位到正确的地方。
我需要一个像高亮功能的功能,而不是突出显示的文字,给出匹配的位置到高亮位置。
所以如果我搜索dolor,它会给出这样的东西:

The problem is that I need somehow the matched text's position (like xpath) with search results, because i need to position the reader software to the right place. I need a functionality like highlight feature, but instead of highlighted text, give the where-to-highlight position of matches. So if i search for "dolor" it gives back something like this:

matches:[
...
  {"match":"dolor", "xpath":"//*[@id="lipsum"]/p[1]/b"}
...
]

标准场景(我发现无处不在),如带有过滤器的HTML HTML字符,然后是标记等等,不适用于此,因为它在第一步中失去了位置信息。

The standard scenario (what i found everywhere) like strip html chars with filter, then tokenize, etc, not applies here, because it loses the position information in the first step.

任何建议?甚至可以用Solr或ElasticSearch?谢谢!

Any suggestions? Is that even possible with Solr or ElasticSearch? Thanks!

推荐答案

您的问题是关于xpath作为xhtml-Dokument突出显示的结果。

Your question is about xpath as result of highlighting for a xhtml-Dokument.

我不知道在solr或弹性搜索中运行的解决方案。在可扩展文本框架('XTF')中有一些非常相似的东西,它是基于(旧版本的)Lucene。
在XTF中,您可以获取突出显示为标签原来的xml-File。所以应该很容易写一个xsl-Transformation来生成相应的xpath。

I do not know about a running solution in solr or elasticsearch. There is something very similar in the eXtensible Text Framework(´XTF´) which is build on (an old version of) Lucene. In XTF you can get the highlighting as tags in the original xml-File. So it should be easy the write an xsl-Transformation to generate the corresponding xpaths.

简而言之,主要思想是将EPUB-book以重叠的块分割,并将xml-structure作为特殊字符存储在索引和存储字段中。使用突出显示的信息,您可以重新转换原始xml结构以找到您的xpath。

Main idea in short would be to split the EPUB-book in overlapping chunks and store the xml-structure as special characters in the indexed and stored field. With highlighting information you can then reconvert the original xml-structure to find your xpaths.

这篇关于如何索引html内容,保持位置(如xpath,css选择器等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆