如何解析与java不同的编码的字符串 [英] How to parse a string that is in a different encoding from java

查看:75
本文介绍了如何解析与java不同的编码的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串,我从一个Word文档读入。我认为是在Cp1252编码。 Java使用UTF8。

I have a string that I have read in from a Word document. I think it is in "Cp1252" encoding. Java uses UTF8.

如何在Cp1252中搜索那些特殊字符的字符串,并用适当的UTF8字符替换它们?

How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?

具体来说,我想用一个简单的 - 替换En Dash字符

specifically, I want to replace the "En Dash" character with a plain "-"

下面的代码块使用了projDateString Word文档,并尝试做这样的事情

The following code block takes the projDateString which is coming from the Word document, and trying to do such a thing

    char[] test = projDateString.getBytes("Cp1252");
    for(int i = 0; i < test.length; i++){
    System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
    }
    String projDateString2 = new String(test);
    projDateString2.replaceAll("\0x96", "\u2013");
    System.out.println("projDateString2: " + projDateString)

我正确设置projDateString2。正如你可以看到,当我getBytes使用Cp1252编码的字符串,该破折号的十六进制值是ffffff96。如果我getBytes与UTF8它来自3个十六进制值而不是一个。

I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.

这给我以下输出:

test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 ΓÇô Present


推荐答案

在不指定编码的情况下使用 String.getBytes() new String(byte []) 问题。那些总是使用平台默认编码 - 这几乎总是错误的选择。

You shouldn't be using String.getBytes() or new String(byte[]) without specifying the encoding - that's the problem. Those always use the platform default encoding - which is almost always the wrong choice.

你说你有一个我读过的字符串从Word文档 - 你怎么读的?

You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

如果您有字节,而且您知道相关编码,则应该使用:

If you have the bytes and you know the relevant encoding, you should use:

String text = new String(bytes, encoding);

你永远不必处理使用错误编码创建的字符串 - 如果你得到到这个阶段,您几乎被限制会有信息丢失的风险。

You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

需要了解的事情是, Java中的 String 类是不可变的。在字符串上调用 replaceAll 将不会更改现有字符串

The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won't change the existing string. It will instead return a new string with the replacements made.

因此,这个语句:

projDateString2.replaceAll("\0x96", "\u2013");

将永远不会做你想要的。即使一切正确,您应该使用:

will never do what you want. Even if everything else is correct, you should be using:

projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");

(或类似的东西)。我不认为实际上会做你想要的,但是你需要知道它,当一切都被整理出来。

(or something similar). I don't think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.

这篇关于如何解析与java不同的编码的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆