编码为UCS-2 Little Endian的文件报告Java的行数增加了2倍 [英] File encoded as UCS-2 Little Endian reports 2x too many lines to Java

查看:108
本文介绍了编码为UCS-2 Little Endian的文件报告Java的行数增加了2倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一个简单的Java程序处理多个txt文件,而我的过程的第一步是计算每个文件的行数:

I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:

int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
    count++;
}

对于我的一个文件,Java计算的行数恰好是其中的两倍真的是!起初这让我非常困惑。我在Notepad ++中打开了每个文件,可以看到错误计数的文件以与其他文件完全相同的方式以CR和LF结束了每一行。我做了一些检查,发现我所有的 ok文件都经过ANSI编码,而一个问题文件则编码为UCS-2 Little Endian(我一无所知)。我把这些文件放在别处,所以我不知道为什么用这种方式编码,但是当然将其切换到ANSI可以解决此问题。

For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.

但是现在好奇心仍然存在。为何编码导致双行计数报告?

But now curiosity remains. Why was the encoding causing a double line count report?

谢谢!

推荐答案

简单:如果在读取UCS-2(或UTF-16)文本时应用了错误的编码(例如ANSI或任何8位编码),则第二个字符为0x0。然后,这会将CR-LF分解为CR-0-LF,这被视为两行更改(一个用于CR,一个用于LF)。

Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).

这篇关于编码为UCS-2 Little Endian的文件报告Java的行数增加了2倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆