将文件从 Cp1252 转换为 utf -8 java [英] Convert file from Cp1252 to utf -8 java

查看:34
本文介绍了将文件从 Cp1252 转换为 utf -8 java的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用户上传文件的字符编码为:Cp1252

User uploads file with the character encoding : Cp1252

由于我的 mysql 表列排序规则为 utf8_bin,我尝试将文件转换为 utf-8,然后使用 LOAD DATA INFILE 命令将数据放入表中.

Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE command.

Java 源代码:

OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
    writ.write(in);
    writ.newLine();
}
writ.flush();
writ.close();

似乎字符没有正确转换.转换后的 unicode 文件在多个位置有 和框符号.如何有效地将文件转换为 uft-8?谢谢.

It seems that characters are not converted correctly. Converted unicode file has � and box symbols at multiple places. How to convert file efficiently to uft-8? Thanks.

推荐答案

验证转换过程的一种方法是配置字符集解码器和编码器来避免错误,而不是默默地用特殊字符替换错误字符:

One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters:

CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
  .onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT);

CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
  .onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT);

try(FileInputStream is=new FileInputStream(filepath);
    BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
    FileOutputStream fw=new FileOutputStream(destpath);
    BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {

    for(String in; (in = reader.readLine()) != null; ) {
        out.write(in);
        out.newLine();
    }
}

请注意,这里的输出编码器配置为对称,但是 UTF-8 能够对每个 unicode 字符进行编码,但是,一旦您想使用相同的代码执行其他转化.

Note that the output encoder is configured for symmetry here, but UTF-8 is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions.

此外,请注意,如果输入文件采用不同的编码但错误解释字节会导致有效字符,这将无济于事.需要考虑的一件事是输入编码 "windows-1252" 是否真的意味着系统的默认编码(以及这是否真的相同).如果有疑问,您可以使用 Charset.defaultCharset() 而不是 Charset.forName("windows-1252") 当实际预期的转换是 default代码> → <代码>UTF-8.

Further, note that this won’t help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. One thing to consider is whether the input encoding "windows-1252" actually meant the system’s default encoding (and whether that is really the same). If in doubt, you may use Charset.defaultCharset() instead of Charset.forName("windows-1252") when the actually intended conversion is defaultUTF-8.

这篇关于将文件从 Cp1252 转换为 utf -8 java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆