如何在Lucene索引中存储自定义令牌属性 [英] How to store custom token attribute in Lucene Index

查看:71
本文介绍了如何在Lucene索引中存储自定义令牌属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为RDF节点创建一个Lucene分析器. RDF节点可以具有多种类型(uri,bnode,普通文字,具有语言的普通文字,具有数据类型的类型文字).在分析术语时,我想创建一个RDFNodeTypeAttribute,LanguageAttribute和DatatypeAttribute分别存储RDF节点的类型,文字的语言和datatype属性.我的问题是如何将这些属性存储在Lucene索引中.我必须编写自定义编解码器吗?我必须使用PayloadAttribute吗?一旦存储在索引中,如何利用这些属性进行搜索? 谢谢您的帮助

I want to create a Lucene analyzer for RDF nodes. RDF nodes can have multiple types (uri, bnode, plain literal, plain literal with language, typed literal with datatype). While analyzing the term, I want to create a RDFNodeTypeAttribute, LanguageAttribute and DatatypeAttribute to store respectively the type of RDF node, the language of the literal and the datatype attribute. My question is how these attributes can be stored in lucene index. Do I have to write a custom Codecs ? Do I have to use the PayloadAttribute ? How can I leverage these attributes once stored in the index for my search ? Thank you for your help

推荐答案

我无法完全满足您的要求,但是如果您对Lucene索引的编码和解码方式不满意,可以使用编解码器.编解码器使您可以灵活地拥有自己的PostingsFormat,SegmentInfosFormat,LiveDocsFormat等.因此,让我们说,您想要一个与默认的Lucence编解码器不同的postingsFormat-更像是每个术语,存储它出现的所有docId,多少次.发生在文档中,特定格式的什么位置等.如果您希望此信息以其他格式存储,则需要一个编解码器.

I could not exactly get your requirements but you would use Codecs if you are not happy with the way a Lucene index is encoded and decoded. Codecs gives you flexibility to have your own PostingsFormat, SegmentInfosFormat, LiveDocsFormat etc. So let us say, you want a different postingsFormat from the default Lucence codec - which is more like for every term, store all docIds it occurs in, how many times it occurs in a doc, at what position etc in a particular format. If you want this information to be stored in a different format, you would need a codec.

我认为您不需要为此编写任何编解码器或任何PostingFormat.也许编写您自己的Analyzer和相似性类就足够了.如果您提供有关您的问题的更多信息,我可以再考虑一下.

I do not think you need to write any Codec or any PostingFormat for this. Perhaps writing your own Analyzer and Similarity classes should be sufficient. If you give more information about your problem, I can think further.

有效负载处于术语级别,典型的用例是为每个术语存储元数据.因此,用例如下:该术语用粗体表示,或者是名词等,是该术语的元数据,应存储在有效载荷中.您实际上是使用有效载荷对文档进行评分,它们对于给术语赋予一定的权重很重要.

Payload is at term level and typical use case is to store meta data for every term. So, a use case like: this term is written in Bold,or is a noun etc are meta data for the term and should be stored in a payload. You actually use payloads for scoring of the docs and they matter in giving a term some weight.

尽管RDF是Web资源的元数据,但您可能正在谈论索引RDF本身.即使它是Web文档的一部分,您也正在建立索引,将RDF信息放入Web文档中的每个术语都不是一种可行的方法,因为有更好的方法为文档分配权重.

Though RDF is a metadata for a web resource, you are probably talking about indexing RDF itself. Even if it is part of the web document, you are indexing, putting the RDF info for every term in the web document will not be a viable approach, as there are better ways to allocate weights to a document than that.

这篇关于如何在Lucene索引中存储自定义令牌属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆