Solr 能否保留在其结果中提供给它的 HTML 文档的格式? [英] Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

查看:25
本文介绍了Solr 能否保留在其结果中提供给它的 HTML 文档的格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何维护 HTML 文档的原始格式Solr 给出的结果?

How do I maintain the Original formatting of the HTML document in the results given by Solr?

我正在尝试在我公司的一个网站中提供搜索功能,该网站拥有数百万个文档,并且所有文档的格式都不相似,因此很难单独设置每个文档的格式.

I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document.

我在 apache 站点使用 Solr 4.1 nightly builds,该站点内置了对 solr 的支持-细胞和蒂卡.即我不需要单独配置它们.

I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them.

solr-cell 或 tika 是否在任何地方保留这些格式?

does solr-cell or tika retains these formatting anywhere?

如果它不保留格式,那么我需要使用 solr 的 resourcename 字段从物理文件位置获取每个文档并应用高亮显示和其他 solr 现成的功能,但这个过程是太累了.

If it does not retain the formatting then I'll need to fetch each document from physical file location using resourcename field of solr and apply the highlights and other solr ready made functionality, But this process is too tedious.

如果我必须使用 Jayendra 在答案中建议的HTMLStripCharFilterFactory",我可以使用什么作为请求处理程序?在这种情况下,我也可以提取元数据标签吗?

What can i use as a Request Handler if i have to use "HTMLStripCharFilterFactory" as suggested by Jayendra in the answer? also can i extract metadata tags in that case?

谁能指导我这件事!

感谢大家的支持.!!!

Thank you for all your support.!!!

推荐答案

Solr Cell with Tika 不保持文档的原始格式.
您只能从通过 Tika 提供给 Solr 的文档中获取提取的文本.

Solr Cell with Tika does not maintain the original formatting of the document.
You would get only the extracted text from the documents fed to Solr through Tika.

否则,您必须将 html 文档作为普通 Solr 字段提供并应用 HTMLStripCharFilterFactory 过滤以保留两个副本.

Else you have to feed the html document as a normal Solr field and apply HTMLStripCharFilterFactory filter to maintain both copies.

当 storage=true 时,Solr 将使用 HTML 字段维护原始文档.
但是,对于 Search (indexed=true),搜索只会发生在 Content 而不是 html 元素上.

Solr will maintain the Original Document with HTML fields when stored=true.
However, for Search (indexed=true) the search will only happen on the Content and not on the html elements.

这篇关于Solr 能否保留在其结果中提供给它的 HTML 文档的格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆