Java:检测给定字符编码的不可显示字符 [英] Java: Detect non-displayable chars for a given Character Encoding

查看:117
本文介绍了Java:检测给定字符编码的不可显示字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理一个应用程序以验证和解析CSV文件。
CSV文件必须以UTF-8编码,尽管有时我们得到的文件是伪编码。
CSV文件最可能包含德语字母表(Ä,Ö,Ü,ß)的特殊字符,因为CSV文件中的大多数文本都是德语。

I'm currently working on an application to validate and parse CSV-files. The CSV files have to be encoded in UTF-8, although sometimes we get files in a false encoding. The CSV-files most likely contain special characters of the German alphabet (Ä, Ö, Ü, ß) as most of the texts within the CSV file are in German language.

对于验证器的一部分,我需要确保,该文件是UTF-8编码的。只要没有特殊字符存在,就很可能没有解析的问题。

For the part of the validator, i need to make sure, the file is UTF-8 encoded. As long as there are no special characters present, there is most likely no problem with parsing.

到目前为止,我试图读取的文件作为字节和使用一些库来检测(或猜测)编码。我尝试了这篇博文的大部分可能性: http:// fredeaker。 blogspot.com/2007/01/character-encoding-detection.html

What i have tried so far is to read the file as bytes and use some libraries to detect (or guess) the encoding. I tried most of possibilities of this blog post: http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

但我尝试的所有库没有返回正确的编码,因此我couldn

But all libraries I tried didn't return the correct encoding and therefore I couldn't parse the special characters.

现在我的问题:
有一种方法可以确定给定的字符编码,如UTF-8,以检测字符没有正确编码?因此,基本上显示在(Eclipse)控制台中的字符为问题标记。

Now to my question: Is there a way to determine for a given Character Encoding like UTF-8 to detect characters that are not encoded correctly? So basically the characters that are displayed in the (Eclipse) console as quesion marks.

还是有其他方法来正确确定字符编码?
我只需要知道它是否为UTF-8。

Or is there any other way to correctly determine the character encoding? I just need to know if it's UTF-8 or not.

感谢大家的帮助! :)

Thank you all in advance for your help! :)

最好的问候,
Robert

Best Regards, Robert

推荐答案

无法正确解码的字节序列将替换为替换字符, \\\� ,其显示方式如下:&#xFFFD ;.但是,如果输出设备不支持该字符,则可能使用问号(?)。

Byte sequences that cannot be decoded correctly will be replaced with the "replacement character", \uFFFD, which is displayed like this: �. However, if the output device doesn't support that character, it is likely to use a question mark (?) instead.

因此,在解码UTF-8数据到 String 对象中,搜索 \\\� 的出现。

So, after decoding the UTF-8 data into String objects, search for occurrences of \uFFFD.

或者,如果您使用 InputStreamReader /java/nio/charset/CharsetDecoder.htmlrel =nofollow noreferrer> CharsetDecoder ,你可以得到更多的控制。例如,您可以指定如果任何无法解码的字节序列,应提出异常。或者你可以忽略它们。或者,您可以指定不同的字符作为替换字符。

Alternatively, if you set up an InputStreamReader with an instance of CharsetDecoder that you create yourself, you can get a lot more control. For example, you can specify that if any byte sequences that cannot be decoded, an Exception should be raised. Or you can ignore them. Or you can specify a different character as the replacement character.

这篇关于Java:检测给定字符编码的不可显示字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆