Lucene编码,Java [英] Lucene encoding, java

查看:85
本文介绍了Lucene编码,Java的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Lucene(java)中的编码有疑问.

I have questions about encoding in Lucene (java).

如何在Lucene中进行编码?哪个是默认值,如何设置?

How is working with coding in Lucene? which is the default and how can I set it?

或者Lucene无关紧要的编码方式,而仅仅是在索引阶段如何在文档中添加字符串(下面的Java代码),然后在索引中进行搜索?

Or Lucene does not matter what it is encoding and it's just a matter of how adding a string to a document (java code is below) in the indexing phase, and then in the search in the index?

换句话说,我必须担心输入文本是否为UTF-8,查询是否也为utf-8?

In other words, I have to worry if the input text is in UTF-8 and query are also in utf-8?

Document doc = new Document ();  
doc.add (new TextField (tagName, object.getName () Field.Store.YES));

感谢您的帮助

推荐答案

Lucene将术语存储在UTF-8中. (请参阅Lucene的BytesRef类) Java内部将所有内容存储在UTF-16中. (Java的String是UTF-16).因此,Lucene的BytesRef为您提供了一个构造函数,该构造函数将UTF16转换为UTF8.因此,可以毫无问题地使用Java的String.

Lucene stores terms in UTF-8. (See Lucene's BytesRef class) Java internally stores everything in UTF-16. (Java's String is UTF-16). So, Lucene's BytesRef gives you a constructor where it converts UTF16 to UTF8. Hence Java's String can be used without any issues.

例如,您在代码中使用的TextField将String用作Field值. 如果您还有其他需要byte []的字段类型,则需要确保它们是UTF8字节.

For example, TextField what you have used in your code uses String for Field value. If you have some other type of Field which takes byte[] then you need to make sure they are UTF8 bytes.

在查询时,Lucene会始终为您提供UTF-8字节,但是您可以通过同一类中提供的方法将其转换为Java的String.您始终可以在其他字符集中解释这些字节.

While querying, Lucene will always give you UTF-8 bytes, however you can convert that to Java's String by a method provided in the same class.You can always interpret these bytes in other character sets.

您必须自己进行字符编码-只要您可以正确地在Java的String中获取字符,就可以了.例如:如果您要索引的数据来自具有diff字符集的XML或从diff字符集中的DB读取.您必须确保可以在用于索引的JVM中正确读取这些数据源.

You have to take care of Character Encoding yourself - as long as you can get the characters right in Java's String, you should be fine. For eg: If the data you are indexing is from an XML with a diff char set or reading from a DB in a diff char set. You will have to make sure that you can read these data sources properly in the JVM used for indexing.

这篇关于Lucene编码,Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆