引起Java Scanner的卷曲hasNextLine()为false - 为什么? [英] Curly quotes causing Java Scanner hasNextLine() to be false -- why?

查看:173
本文介绍了引起Java Scanner的卷曲hasNextLine()为false - 为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经有一个问题,让java.util.Scanner读取我保存在记事本中的文本文件,即使它与其他文件正常工作。基本上,当它尝试读取问题文件时,它完全没空 - hasNextLine()为false,缓冲区为空,等等。我把它缩小到如果有的话,甚至不会读第一行是文件中的任何地方的卷曲 。没有例外被抛出。请注意,同一文件中的BufferedReader没有问题。

  try {
int count = 0;
扫描仪扫描仪=新扫描仪(新文件(C:/myfile.txt));

while(scanner.hasNextLine()){
count ++;
scanner.nextLine();
}

scanner.close();
System.out.print(count);

count = 0;
BufferedReader reader = new BufferedReader(new FileReader(C:/myfile.txt));

while(reader.readLine()!= null){
count ++;
}

reader.close();
System.out.print(count);
}
catch(IOException e){
e.printStackTrace();
}

上面的代码读取一个只包含一个卷曲引号的文件,打印出01。在Google上搜索导致我尝试这样做:

 扫描仪扫描仪=新扫描仪(新文件(C:/myfile.txt ),ISO-8859-1); 

这使它工作(即打印出11)。我也注意到,如果我进入记事本并做另存为...底部的默认编码是ANSI。如果我将其更改为UTF-8并保存文件,则扫描仪(无编码)也可以正常工作。如果我告诉扫描仪UTF-8,那么可以理解的是,只有当我保存为UTF-8时,它才起作用,但是即使将其保存为ANSI,ISO-8859-1也可以使其工作。 >

所以,我知道它与文件编码有关,但问题是我不了解文件编码的任何内容。我对ISO-8859-1的意思是非常模糊的知识;为什么这样会使它工作,无论我如何保存文件?为什么BufferedReader无论如何工作?



编辑:



下面的链接/评论真的帮助我指出了正确的方向!我想我已经弄清楚了。



首先,在记事本中:




  • ANSI是CP1252

  • Unicode是UTF-16LE

  • UTF-8 -8



十六进制中,一个卷曲的撇号表示为:




  • CP1252:92

  • UTF-16LE:1920

  • UTF-8:E2 80 99



根据Charset.defaultCharset(),我的系统使用的默认编码是UTF-8。所以当我将文件保存在UTF-8中时,扫描仪知道会有什么期望。然而,当我将文件保存在CP1252中时,一旦它被命名为92,它就会被阻塞,因为它不是在该编码中表示字符的有效方式。只要文件中没有任何这样的分类器,它工作正常 - hello world的十六进制在CP1252和UTF-8中都是一样的,并不会造成问题。



UTF-8不适用于UTF-16文件,因为它不知道如何处理字节顺序标记(FFFE),无论什么字符在文件中。



另一方面,当我将扫描仪设置为CP1252或ISO-8859-1时,它更加宽容。它不一定正确地解释字符正确,记住你,但没有什么可以阻止它识别文件中的行并循环。



为什么Scanner有问题,但是FileReader / BufferedReader没有,我想猜测是因为扫描程序需要对文件进行标记,即。解释字符,以便识别空白和其他图案,所以当有些东西无法识别时,它会阻塞。读者不需要这样做。所有需要识别的都是换行符。

解决方案

如果在创建扫描仪时不指定编码,尝试基于一个字节顺序标记(BOM)来代码编码,这是一个文件的前几个字节。如果没有,它将默认为操作系统使用的任何默认值。由于您使用的是Windows,默认值为cp-1252。似乎记事本正在使用类似于cp-1252的ISO-8859-1保存文本文件。有关详细信息,请参阅此链接:



http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html



当您将其保存为UTF-8时,可能会将UTF-8 BOM放在文件的开头,扫描仪可以接收。



如果你想看看更多的BOM,查看在维基百科 - 这篇文章是相当不错的。您也可以下载PSPad并以十六进制模式打开文本文件以查看各个字节。希望有助于:)


I've been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the problem file, it comes up completely empty handed -- hasNextLine() is false, buffer is empty, etc. I narrowed it down to the fact that it won't even read the first line if there is a curly quote anywhere in the file. No exceptions are thrown. Note that a BufferedReader on the same file doesn't have a problem.

try {        
    int count = 0;
    Scanner scanner = new Scanner(new File("C:/myfile.txt"));

    while (scanner.hasNextLine()) {
        count++;
        scanner.nextLine();
    }

    scanner.close();
    System.out.print(count);

    count = 0;
    BufferedReader reader = new BufferedReader(new FileReader("C:/myfile.txt"));

    while (reader.readLine() != null) {
        count++;
    }

    reader.close();
    System.out.print(count);
}
catch(IOException e) {
    e.printStackTrace();
}

The above code, reading a file that contains nothing but a single curly quote, prints out "01". Searches on Google led me to try this:

Scanner scanner = new Scanner(new File("C:/myfile.txt"), "ISO-8859-1");

This makes it work (ie. it prints out "11"). I also noticed that if I go into Notepad and do a Save As... the default encoding at the bottom is "ANSI." If I change this to "UTF-8" and save the file, then the scanner (without an encoding) also works. If I tell the scanner "UTF-8", then understandably it only works if I save as UTF-8, but "ISO-8859-1" seems to make it work even if I save it as "ANSI".

So, I know it has something to do with file encoding, but the problem is I don't understand anything about file encoding. My knowledge of what "ISO-8859-1" means is extremely vague; why does that make it work no matter how I save the file? Why does BufferedReader work regardless?

EDIT:

The links/comments below really helped point me in the right direction! I think I've got it figured out.

First of all, in Notepad:

  • "ANSI" is CP1252
  • "Unicode" is UTF-16LE
  • "UTF-8" is... well, UTF-8

In hexadecimal, a curly apostrophe is represented as:

  • CP1252: 92
  • UTF-16LE: 1920
  • UTF-8: E2 80 99

The default encoding Java uses on my system, according to Charset.defaultCharset(), is UTF-8. So when I saved the file in UTF-8, the scanner knew what to expect. When I saved the file in CP1252, however, it choked once it hit that "92", because it's not a valid way to represent a character in that encoding. It works fine as long as there aren't any such chracters in the file -- the hex for "hello world" happens to be the same in both CP1252 and UTF-8 and doesn't happen to cause a problem.

UTF-8 doesn't work with a UTF-16 file, because it doesn't know what to do with the byte order mark ("FFFE"), regardless of what characters are in the file.

On the other hand, when I set the scanner to CP1252 or ISO-8859-1, it's much more tolerant. It doesn't necessarily interpret the characters correctly, mind you, but there's nothing that prevents it from recognizing lines in the file and looping through.

As far as why Scanner has a problem but the FileReader/BufferedReader does not, I am going to guess that it's because the scanner needs to tokenize the file, ie. interpret the characters so it can identify whitespace and other patterns, so it chokes when there's something unrecognizable. The reader doesn't need to do that. All it needs to identify are the line breaks.

解决方案

If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:

http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.

If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)

这篇关于引起Java Scanner的卷曲hasNextLine()为false - 为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆