Java文本输出中的UTF-8编码问题 [英] UTF-8 Encoding Problem in Java Text Output

查看：560 发布时间：2019/1/8 19:03:42 java utf-8 nlp

本文介绍了Java文本输出中的UTF-8编码问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在为高棉语Unicode破解者测试各种解决方案（高棉语之间没有空格，这使得拼写检查和语法检查变得困难，以及从传统的高棉语转换为高棉语。）

I've been working on testing various solutions for a Khmer Unicode Wordbreaker (Khmer does not have spaces between words which makes spell checking and grammar checking difficult, as well as converting from legacy Khmer into Khmer Unicode).

我获得了一些现在在线的源代码（ http://www.whitemagicsoftware.com/software/java/wordsplit/ ）看起来很有希望。作者非常友好地给出了消息来源，但他正忙着写一本书并且无法排除故障。

I was given some source code which is now online ( http://www.whitemagicsoftware.com/software/java/wordsplit/ ) that seems promising. The author was kind enough to give the source, but he is very busy writing a book and is unable to troubleshoot.

我正在以非常小的规模测试代码，我输出有问题。

I am testing the code on a very small scale, and I am having trouble with the output.

这是输入：

ជាដែលនឹងបានមាន

以下是结果输出：

ជារ លនឹងមានជា，ជារ លបាន
មាន

ជារ��លនឹងបានមាន,ជា រ��ល នឹង បាន មាន

这些单词实际上是正确分割的，但有一个单词是混乱的。
输出应如下所示：

The words are actually split correctly, but one word is jumbled. The output should look like this:

ជាដែលនឹងបានមាន，ជាដែលនឹងបានមាន

ជាដែលនឹងបានមាន, ជា ដែល នឹង បាន មាន

有没有人知道为什么输出会出现乱码？

Does anyone have an insight as to why the output is garbled?

这是带有a的代码非常小的高棉语词典和要拆分的词语： http://www.sbbic.org/khmerwordsplit.zip

Here's the code with a very small Khmer lexicon and words to be split: http://www.sbbic.org/khmerwordsplit.zip

以下是如何运行它：

java -jar wordsplit .jar
khmerlexicon.csv khmercolumns.txt >>
results.txt

java -jar wordsplit.jar khmerlexicon.csv khmercolumns.txt >> results.txt

我非常感谢stackoverflow社区为您目前为此项目提供的所有帮助 - 我希望很快找到解决方案！

I am very grateful to the stackoverflow community for all the help you have provided with this project so far - I hope a solution is soon to be found!

推荐答案

我注意到当系统编码配置为UTF-8时它可以正常工作：

I noticed that it works correctly when system encoding is configured as UTF-8:

java -Dfile.encoding=UTF-8 -jar wordsplit.jar khmerlexicon.csv khmercolumns.txt >> results.txt

也许假设输入文件采用系统编码。阅读评论中提到的 BalusC的帖子了解如何独立于系统编码执行输入/输出。

Perhaps input file is assumed to be in system encoding. Read BalusC's post mentioned in the comments to see how to perform input/output independent from system encoding.

这篇关于Java文本输出中的UTF-8编码问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Java文本输出中的UTF-8编码问题 [英] UTF-8 Encoding Problem in Java Text Output

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java文本输出中的UTF-8编码问题 [英] UTF-8 Encoding Problem in Java Text Output

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭