Lucene编码,Java [英] Lucene encoding, java
问题描述
我对Lucene(java)中的编码有疑问.
I have questions about encoding in Lucene (java).
如何在Lucene中进行编码?哪个是默认值,如何设置?
How is working with coding in Lucene? which is the default and how can I set it?
或者Lucene无关紧要的编码方式,而仅仅是在索引阶段如何在文档中添加字符串(下面的Java代码),然后在索引中进行搜索?
Or Lucene does not matter what it is encoding and it's just a matter of how adding a string to a document (java code is below) in the indexing phase, and then in the search in the index?
换句话说,我必须担心输入文本是否为UTF-8,查询是否也为utf-8?
In other words, I have to worry if the input text is in UTF-8 and query are also in utf-8?
Document doc = new Document ();
doc.add (new TextField (tagName, object.getName () Field.Store.YES));
感谢您的帮助
推荐答案
Lucene将术语存储在UTF-8中. (请参阅Lucene的BytesRef类) Java内部将所有内容存储在UTF-16中. (Java的String是UTF-16).因此,Lucene的BytesRef为您提供了一个构造函数,该构造函数将UTF16转换为UTF8.因此,可以毫无问题地使用Java的String.
Lucene stores terms in UTF-8. (See Lucene's BytesRef class) Java internally stores everything in UTF-16. (Java's String is UTF-16). So, Lucene's BytesRef gives you a constructor where it converts UTF16 to UTF8. Hence Java's String can be used without any issues.
例如,您在代码中使用的TextField将String用作Field值. 如果您还有其他需要byte []的字段类型,则需要确保它们是UTF8字节.
For example, TextField what you have used in your code uses String for Field value. If you have some other type of Field which takes byte[] then you need to make sure they are UTF8 bytes.
在查询时,Lucene会始终为您提供UTF-8字节,但是您可以通过同一类中提供的方法将其转换为Java的String.您始终可以在其他字符集中解释这些字节.
While querying, Lucene will always give you UTF-8 bytes, however you can convert that to Java's String by a method provided in the same class.You can always interpret these bytes in other character sets.
您必须自己进行字符编码-只要您可以正确地在Java的String中获取字符,就可以了.例如:如果您要索引的数据来自具有diff字符集的XML或从diff字符集中的DB读取.您必须确保可以在用于索引的JVM中正确读取这些数据源.
You have to take care of Character Encoding yourself - as long as you can get the characters right in Java's String, you should be fine. For eg: If the data you are indexing is from an XML with a diff char set or reading from a DB in a diff char set. You will have to make sure that you can read these data sources properly in the JVM used for indexing.
这篇关于Lucene编码,Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!