Solr是否可以保留将HTML文档格式设置为结果的格式? [英] Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

查看:66
本文介绍了Solr是否可以保留将HTML文档格式设置为结果的格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在HTML文档中维护HTML文档的原始格式.Solr给出的结果?

How do I maintain the Original formatting of the HTML document in the results given by Solr?

我正试图在我的一个公司网站中提供搜索功能,该网站拥有数百万个文档,并且都没有类似的格式,因此很难单独格式化每个文档.

I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document.

我正在apache网站上使用 Solr 4.1夜间构建,该站点已对solr-提供内置支持细胞和蒂卡.也就是说,我不需要分别配置它们.

I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them.

solr-cell或tika可以在任何地方保留这些格式吗?

does solr-cell or tika retains these formatting anywhere?

如果它不保留格式,那么我需要使用solr的 resourcename 字段从物理文件位置获取每个文档,并应用突出显示和其他solr现成的功能,但是此过程是太乏味了.

If it does not retain the formatting then I'll need to fetch each document from physical file location using resourcename field of solr and apply the highlights and other solr ready made functionality, But this process is too tedious.

如果我必须使用Jayendra在答案中建议的"HTMLStripCharFilterFactory",可以将什么用作请求处理程序?在这种情况下,我还可以提取元数据标签吗?

What can i use as a Request Handler if i have to use "HTMLStripCharFilterFactory" as suggested by Jayendra in the answer? also can i extract metadata tags in that case?

有人可以指导我吗!

感谢您的支持.!!!

Thank you for all your support.!!!

推荐答案

带有Tika的Solr Cell不保留文档的原始格式.
您只会从通过Tika提交给Solr的文档中提取文本.

Solr Cell with Tika does not maintain the original formatting of the document.
You would get only the extracted text from the documents fed to Solr through Tika.

否则,您必须将html文档作为普通的Solr字段提供,并应用 HTMLStripCharFilterFactory 过滤器以保留两个副本.

Else you have to feed the html document as a normal Solr field and apply HTMLStripCharFilterFactory filter to maintain both copies.

当storage = true时,Solr将使用HTML字段维护原始文档.
但是,对于搜索(indexed = true),搜索将仅在内容而不是html元素上进行.

Solr will maintain the Original Document with HTML fields when stored=true.
However, for Search (indexed=true) the search will only happen on the Content and not on the html elements.

这篇关于Solr是否可以保留将HTML文档格式设置为结果的格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆