如何索引html内容，保持位置（如xpath，css选择器等） [英] How to index html content, keeping positions (as xpath, css selector, etc)

查看：143 发布时间：2017/8/7 3:36:24 elasticsearch solr lucene

本文介绍了如何索引html内容，保持位置（如xpath，css选择器等）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想为HTML内容创建一个全文搜索索引（更具体地说：XHTML格式的EPUB章节）。像这样：

I want to create a full-text search index for HTML content (to be more specific: EPUB chapters in XHTML format). Like this:

...
<p>Lorem ipsum <b>dolor</b> sit amet, consectetur adipiscing elit.</p>
...

问题是我需要某种方式匹配文本的位置（如xpath ）与搜索结果，因为我需要将读者软件定位到正确的地方。
我需要一个像高亮功能的功能，而不是突出显示的文字，给出匹配的位置到高亮位置。
所以如果我搜索dolor，它会给出这样的东西：

The problem is that I need somehow the matched text's position (like xpath) with search results, because i need to position the reader software to the right place. I need a functionality like highlight feature, but instead of highlighted text, give the where-to-highlight position of matches. So if i search for "dolor" it gives back something like this:

matches:[
...
  {"match":"dolor", "xpath":"//*[@id="lipsum"]/p[1]/b"}
...
]

标准场景（我发现无处不在），如带有过滤器的HTML HTML字符，然后是标记等等，不适用于此，因为它在第一步中失去了位置信息。

The standard scenario (what i found everywhere) like strip html chars with filter, then tokenize, etc, not applies here, because it loses the position information in the first step.

任何建议？甚至可以用Solr或ElasticSearch？谢谢！

Any suggestions? Is that even possible with Solr or ElasticSearch? Thanks!

如何索引html内容，保持位置（如xpath，css选择器等） [英] How to index html content, keeping positions (as xpath, css selector, etc)

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

如何索引html内容，保持位置（如xpath，css选择器等） [英] How to index html content, keeping positions (as xpath, css selector, etc)

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭