读取文件时找不到零宽度无间隔 [英] Cannot find ZERO WIDTH NO-BREAK SPACE when reading file

查看:62
本文介绍了读取文件时找不到零宽度无间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试解析从文件中获取的JSON字符串时遇到问题.我的问题是读入时零宽度不间断空格字符(Unicode 0xfeff)位于字符串的开头,并且无法摆脱它.我不想使用正则表达式,因为可能会有其他隐藏字符具有不同的unicode.

I've run into a problem when trying to parse a JSON string that I grab from a file. My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. I don't want to use regex because of the chance there may be other hidden characters with different unicodes.

这就是我所拥有的:

StringBuilder content = new StringBuilder();
    try {
        BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
        String currentLine;
        while((currentLine = br.readLine()) != null) {
            content.append(currentLine);
        }
        br.close();
    } catch(Exception e) {
        Assert.fail();
    }

这是JSON文件的开头(复制粘贴整个内容太长了,但是我已经确认它是有效的):

And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid):

{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...

这是我到目前为止尝试过的:

Here's what I've tried so far:

  • 将JSON文件复制到notepad ++并显示所有字符
  • 将文件复制到notepad ++并转换为不带BOM和ISO 8859-1的UFT-8
  • 在其他文本编辑器(例如sublime)中打开JSON文件并保存为UFT-8
  • 将JSON文件复制为txt文件,然后在其中读取
  • 尝试使用Scanner而不是BufferedReader
  • 在intellij中,我尝试查看->活动编辑器->显示空白

如何在不以零宽度不间断空格开头的字符串开头读取文件?

How can I read this file in without having the Zero width no-break space character at the beginning of the string?

推荐答案

0xEF 0xBB 0xBF是UTF-8 BOM 0xFE 0xFF是UTF-16BE BOM 是UTF-16LE BOM .如果0xFEFF位于字符串的开头,则表示您已创建带有BOM的UTF编码文本文件. UTF-16 BOM可能原样显示为0xFEFF,而UTF-8 BOM仅在将BOM本身从UTF-8解码为UTF-16时才显示为0xFEFF(这意味着读者已检测到BOM)但没有跳过).实际上,众所周知Java不处理UTF-8 BOM(请参见bugs JDK- 4508058 JDK-6378911 ).

0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).

如果您阅读了FileReader 文档,它说:

If you read the FileReader documentation, it says:

此类的构造函数假定默认字符编码和默认字节缓冲区大小是适当的.要自己指定这些值,请在FileInputStream上构造一个InputStreamReader.

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

您需要使用一种识别字符集的阅读器来读取文件内容,最好是一种可以为您阅读BOM并根据需要在内部进行调整的阅读器.但更糟糕的情况是,您可以自己打开文件,读取前几个字节以检测是否存在BOM表,然后使用适当的字符集构造读取器以读取文件的其余部分.这是使用 org.apache.commons.io.input.BOMInputStream 就是这样:

You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:

(来自 https://stackoverflow.com/a/13988345/65863 )

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

这篇关于读取文件时找不到零宽度无间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆