Lucene编码，Java [英] Lucene encoding, java

查看：85 发布时间：2020/5/4 7:57:21 java utf-8 lucene

本文介绍了Lucene编码，Java的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对Lucene(java)中的编码有疑问.

I have questions about encoding in Lucene (java).

如何在Lucene中进行编码?哪个是默认值，如何设置?

How is working with coding in Lucene? which is the default and how can I set it?

或者Lucene无关紧要的编码方式，而仅仅是在索引阶段如何在文档中添加字符串(下面的Java代码)，然后在索引中进行搜索?

Or Lucene does not matter what it is encoding and it's just a matter of how adding a string to a document (java code is below) in the indexing phase, and then in the search in the index?

换句话说，我必须担心输入文本是否为UTF-8，查询是否也为utf-8?

In other words, I have to worry if the input text is in UTF-8 and query are also in utf-8?

Document doc = new Document ();  
doc.add (new TextField (tagName, object.getName () Field.Store.YES));

感谢您的帮助

推荐答案

Lucene将术语存储在UTF-8中. (请参阅Lucene的BytesRef类) Java内部将所有内容存储在UTF-16中. (Java的String是UTF-16).因此，Lucene的BytesRef为您提供了一个构造函数，该构造函数将UTF16转换为UTF8.因此，可以毫无问题地使用Java的String.

Lucene stores terms in UTF-8. (See Lucene's BytesRef class) Java internally stores everything in UTF-16. (Java's String is UTF-16). So, Lucene's BytesRef gives you a constructor where it converts UTF16 to UTF8. Hence Java's String can be used without any issues.

例如，您在代码中使用的TextField将String用作Field值. 如果您还有其他需要byte []的字段类型，则需要确保它们是UTF8字节.

For example, TextField what you have used in your code uses String for Field value. If you have some other type of Field which takes byte[] then you need to make sure they are UTF8 bytes.

在查询时，Lucene会始终为您提供UTF-8字节，但是您可以通过同一类中提供的方法将其转换为Java的String.您始终可以在其他字符集中解释这些字节.

While querying, Lucene will always give you UTF-8 bytes, however you can convert that to Java's String by a method provided in the same class.You can always interpret these bytes in other character sets.

您必须自己进行字符编码-只要您可以正确地在Java的String中获取字符，就可以了.例如:如果您要索引的数据来自具有diff字符集的XML或从diff字符集中的DB读取.您必须确保可以在用于索引的JVM中正确读取这些数据源.

You have to take care of Character Encoding yourself - as long as you can get the characters right in Java's String, you should be fine. For eg: If the data you are indexing is from an XML with a diff char set or reading from a DB in a diff char set. You will have to make sure that you can read these data sources properly in the JVM used for indexing.

这篇关于Lucene编码，Java的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Lucene编码，Java [英] Lucene encoding, java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Lucene编码，Java [英] Lucene encoding, java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭