Java文本输出中的UTF-8编码问题 [英] UTF-8 Encoding Problem in Java Text Output

查看:560
本文介绍了Java文本输出中的UTF-8编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在为高棉语Unicode破解者测试各种解决方案(高棉语之间没有空格,这使得拼写检查和语法检查变得困难,以及从传统的高棉语转换为高棉语。)

I've been working on testing various solutions for a Khmer Unicode Wordbreaker (Khmer does not have spaces between words which makes spell checking and grammar checking difficult, as well as converting from legacy Khmer into Khmer Unicode).

我获得了一些现在在线的源代码( http://www.whitemagicsoftware.com/software/java/wordsplit/ )看起来很有希望。作者非常友好地给出了消息来源,但他正忙着写一本书并且无法排除故障。

I was given some source code which is now online ( http://www.whitemagicsoftware.com/software/java/wordsplit/ ) that seems promising. The author was kind enough to give the source, but he is very busy writing a book and is unable to troubleshoot.

我正在以非常小的规模测试代码,我输出有问题。

I am testing the code on a very small scale, and I am having trouble with the output.

这是输入:


ជាដែលនឹងបានមាន

ជាដែលនឹងបានមាន

以下是结果输出:


ជារ លនឹងមានជា,ជារ លបាន
មាន

ជារ���លនឹងបានមាន,ជា រ���ល នឹង បាន មាន

这些单词实际上是正确分割的,但有一个单词是混乱的。
输出应如下所示:

The words are actually split correctly, but one word is jumbled. The output should look like this:


ជាដែលនឹងបានមាន,ជាដែលនឹងបានមាន

ជាដែលនឹងបានមាន, ជា ដែល នឹង បាន មាន

有没有人知道为什么输出会出现乱码?

Does anyone have an insight as to why the output is garbled?

这是带有a的代码非常小的高棉语词典和要拆分的词语: http://www.sbbic.org/khmerwordsplit.zip

Here's the code with a very small Khmer lexicon and words to be split: http://www.sbbic.org/khmerwordsplit.zip

以下是如何运行它:


java -jar wordsplit .jar
khmerlexicon.csv khmercolumns.txt >>
results.txt

java -jar wordsplit.jar khmerlexicon.csv khmercolumns.txt >> results.txt

我非常感谢stackoverflow社区为您目前为此项目提供的所有帮助 - 我希望很快找到解决方案!

I am very grateful to the stackoverflow community for all the help you have provided with this project so far - I hope a solution is soon to be found!

推荐答案

我注意到当系统编码配置为UTF-8时它可以正常工作:

I noticed that it works correctly when system encoding is configured as UTF-8:

java -Dfile.encoding=UTF-8 -jar wordsplit.jar khmerlexicon.csv khmercolumns.txt >> results.txt

也许假设输入文件采用系统编码。阅读评论中提到的 BalusC的帖子了解如何独立于系统编码执行输入/输出。

Perhaps input file is assumed to be in system encoding. Read BalusC's post mentioned in the comments to see how to perform input/output independent from system encoding.

这篇关于Java文本输出中的UTF-8编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆