将文件从 Cp1252 转换为 utf -8 java [英] Convert file from Cp1252 to utf -8 java
问题描述
用户上传文件的字符编码为:Cp1252
User uploads file with the character encoding : Cp1252
由于我的 mysql 表列排序规则为 utf8_bin,我尝试将文件转换为 utf-8,然后使用 LOAD DATA INFILE
命令将数据放入表中.
Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE
command.
Java 源代码:
OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
writ.write(in);
writ.newLine();
}
writ.flush();
writ.close();
似乎字符没有正确转换.转换后的 unicode 文件在多个位置有 和框符号.如何有效地将文件转换为 uft-8?谢谢.
It seems that characters are not converted correctly. Converted unicode file has � and box symbols at multiple places. How to convert file efficiently to uft-8? Thanks.
推荐答案
验证转换过程的一种方法是配置字符集解码器和编码器来避免错误,而不是默默地用特殊字符替换错误字符:>
One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters:
CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
try(FileInputStream is=new FileInputStream(filepath);
BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
FileOutputStream fw=new FileOutputStream(destpath);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {
for(String in; (in = reader.readLine()) != null; ) {
out.write(in);
out.newLine();
}
}
请注意,这里的输出编码器配置为对称,但是 UTF-8
能够对每个 unicode 字符进行编码,但是,一旦您想使用相同的代码执行其他转化.
Note that the output encoder is configured for symmetry here, but UTF-8
is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions.
此外,请注意,如果输入文件采用不同的编码但错误解释字节会导致有效字符,这将无济于事.需要考虑的一件事是输入编码 "windows-1252"
是否真的意味着系统的默认编码(以及这是否真的相同).如果有疑问,您可以使用 Charset.defaultCharset()
而不是 Charset.forName("windows-1252")
当实际预期的转换是 default
代码> → <代码>UTF-8.
Further, note that this won’t help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. One thing to consider is whether the input encoding "windows-1252"
actually meant the system’s default encoding (and whether that is really the same). If in doubt, you may use Charset.defaultCharset()
instead of Charset.forName("windows-1252")
when the actually intended conversion is default
→ UTF-8
.
这篇关于将文件从 Cp1252 转换为 utf -8 java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!