Lucene中多个字段的重复值的影响 [英] Impact of repeat value across multiple fields in Lucene

查看:232
本文介绍了Lucene中多个字段的重复值的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在lucene索引的多个字段中重新索引相同值会产生什么影响?

What would be the impact of re-indexing the same value across multiple fields in a lucene index?

这个想法是某个人的名字是他们的一部分名称及其一般细节。所以我想将该值索引到多个字段中。 Ted Bloggs我的索引可能如下:

The idea is that someone's first name is a part of their name and their general details. So I would want to index that value into multiple fields. Ted Bloggs I might index as follows:

Field        |    Value
-------------|---------
firstName    | Ted
lastName     | Blogs
name         | Ted
name         | Bloggs
general      | Ted
general      | Bloggs
all          | Ted
all          | Bloggs

通过这样做,我可以很容易地形成字段类别但是我担心它可能会有不良表现和/或磁盘使用影响。

By doing this I can easily form categories of fields however I'm worried it may have adverse performance and/or disk usage impacts.

有人可以建议吗

推荐答案

@aishwarya是对的,但要进一步扩展它:

@aishwarya is right, but to expand on it a little bit more:

来自文档


此文件按术语排序。条款首先按字典顺序(按UTF16字符代码)按术语的字段名称排序,并按字节顺序(按UTF16字符代码)按术语的文本排序。

This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text.

该字词将在每个字段中存储一次,因此如果您重复每个字词五次,则您的存储空间将大五倍。但是,术语dic的大小相对于原始数据的大小是对数的,所以我怀疑你会遇到问题。

The term will be stored once per field, so if you repeat each term five times your storage will be five times bigger. However, the size of the term dic is logarithmic with respect to the size of the raw data, so I doubt you will have a problem.

性能损失将是非-existent(Lucene缓存每个字段开始的位置),除非有更多数据会强制内存不足。对于大多数搜索基础架构,您可能会有一个低于几gb的索引,无论如何都很容易适合内存。

The performance penalty will be non-existent (Lucene caches where each field starts) except insofar as having more data will force stuff out of memory. For most search infrastructures, you'll probably have an index of under a few gb, which will easily fit in memory anyway.

这篇关于Lucene中多个字段的重复值的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆