将多值字符串字段添加到Lucene文档中,逗号是否重要? [英] Adding a multi-valued string field to a Lucene Document, do commas matter?

查看:115
本文介绍了将多值字符串字段添加到Lucene文档中,逗号是否重要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个Lucene索引并添加文档。

I'm building a Lucene Index and adding Documents.

我有一个多值的字段,对于这个例子,我将使用Categories。

I have a field that is multi-valued, for this example I'll use Categories.

物品可以有很多类别,例如,牛仔裤可以属于服装,裤子,男士,女士等。

An Item can have many categories, for example, Jeans can fall under Clothing, Pants, Men's, Women's, etc.

将字段添加到文档时,逗号会有所作为吗? Lucene会不会理睬他们?如果我将逗号更改为空格会有区别吗?这会自动使字段成为多值吗?

When adding the field to a document, do commas make a difference? Will Lucene simply ignore them? if I change commas to spaces will there be a difference? Does this automatically make the field multi-valued?

String categoriesForItem = getCategories(); // returns "category1, category2, cat3" from a DB call

categoriesForItem = categoriesForItem.replaceAll(",", " ").trim(); // not sure if to remove comma

doc.add(new StringField("categories", categoriesForItem , Field.Store.YES)); // doc is a Document

我这样做是否正确?还是有另一种创建多值字段的方法吗?

Am I doing this correctly? or is there another way to create multivalued fields?

感谢任何帮助/建议。

推荐答案

这是为每个文档索引multiValued字段的更好方法

This would be a better way to index multiValued fields per document

String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call

String [] categoriesForItems = categoriesForItem.split(","); 
for(String cat : categoriesForItems) {
    doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document 
}

每当多个具有相同名称的字段出现在一个中时文档,反向索引和术语向量将按照字段添加的顺序逻辑地将字段的标记附加到另一个字段。

Whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added.

同样在分析阶段2不同的值将通过setPositionIncrementGap()自动分配位置增量。让我解释为什么需要这样做。

Also during the analysis phase two different values will be seperated by a position increment via setPositionIncrementGap() automatically. Let me explain why this is needed.

文档D1中的字段类别有两个值 - foo bar和foo baz
现在如果你要做一个短语查询bar fooD1不应该出现。这是通过在同一字段的两个值之间添加额外增量来确保的。

Your field "categories" in Document D1 has two values - "foo bar" and "foo baz" Now if you were to do a phrase query "bar foo" D1 should not come up. This is ensure by adding an extra increment between two values of the same field.

如果您自己连接字段值并依赖分析器将其拆分为多个值,bar foo将返回D1,这将是不正确的。

If you yourself concatenate the field values and rely on the analyzer to split it into multiple values "bar foo" would return D1 which would be incorrect.

这篇关于将多值字符串字段添加到Lucene文档中,逗号是否重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆