我有UTF-8,但仍然得到“1字节UTF-8序列的无效字节1” [英] I have UTF-8 - but still get "Invalid byte 1 of 1-byte UTF-8 sequence"

查看:175
本文介绍了我有UTF-8,但仍然得到“1字节UTF-8序列的无效字节1”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我立即创建了一个XML字符串(不是从文件中读取)。然后我使用Cocoon 3通过FOP将其转换为PDF。在中间的某个地方运行Xerces。当我使用硬编码的东西,一切工作。一旦我把一个德国Umlaut放入数据库,并用这些数据丰富我的xml,我得到:

I create a XML String on the fly (NOT reading from a file). Then I use Cocoon 3 to transform it via FOP to a PDF. Somewhere in the middle Xerces runs. When I use the hardcoded stuff everything works. As soon as I put a german Umlaut into the database and enrich my xml with that data I get:

Caused by: org.apache.cocoon.pipeline.ProcessingException: Can't parse the XML string.
at org.apache.cocoon.sax.component.XMLGenerator$StringGenerator.execute(XMLGenerator.java:326)
at org.apache.cocoon.sax.component.XMLGenerator.execute(XMLGenerator.java:104)
at org.apache.cocoon.pipeline.AbstractPipeline.invokeStarter(AbstractPipeline.java:146)
at org.apache.cocoon.pipeline.AbstractPipeline.execute(AbstractPipeline.java:76)
at de.grobmeier.tab.webapp.modules.documents.InvoicePipeline.generateInvoice(InvoicePipeline.java:74)
... 87 more

Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)

然后我调试了我的应用程序,发现,我的Ä(来自数据库)的字节值为196,即为十六进制的C4。这是我以前预期的: http://www.utf8-zeichentabelle.de/

I have then debugged my app and found out, my "Ä" (which comes frome the database) has the byte value of 196, which is C4 in hex. This is what I have expected according to this: http://www.utf8-zeichentabelle.de/

我不知道为什么我的代码失败。

I do not know why my code fails.

我已经尝试手动添加BOM that:

I have then tried to add a BOM manually, like that:

byte[] bom = new byte[3];
bom[0] = (byte) 0xEF;
bom[1] = (byte) 0xBB;
bom[2] = (byte) 0xBF;
String myString = new String(bom) + inputString;

我知道这不是很好,但我尝试了 - 当然失败了。我试图在前面添加一个xml标头:

I know this is not exactly good, but I tried it - of course it failed. I have tried to add a xml header in front:

<?xml version="1.0" encoding="UTF-8"?>

哪个失败了。然后我结合起来失败了。

Which failed too. Then I combined it. Failed.

毕竟我尝试过这样的东西:

After all I tried something like that:

xmlInput = new String(xmlInput.getBytes("UTF8"), "UTF8");

其实没有什么事情,因为它已经是UTF-8。仍然失败。

Which is doing nothing in fact, because it is already UTF-8. Still it fails.

所以...任何想法我做错了什么Xerces期望从我?

So... any ideas what I am doing wrong and what Xerces is expecting from me?

感谢
Christian

Thanks Christian

推荐答案

如果您的数据库只包含一个字节(值为0xC4),那么你是不要使用UTF-8编码。

If your database contains only a single byte (with value 0xC4) then you aren't using UTF-8 encoding.

字符LATIN CAPITAL LETTER A WITH DIAERESIS具有代码点值U + 00C4,但UTF-8不能在单个字节中进行编码。如果您检查UTF8-zeichentabelle.de上的第三列UTF-8(十六进制),您将看到UTF-8编码为0xC3 84(两个字节)。

The character "LATIN CAPITAL LETTER A WITH DIAERESIS" has a code-point value U+00C4, but UTF-8 can't encode that in a single byte. If you check the third column "UTF-8 (hex.)" on UTF8-zeichentabelle.de you'll see that UTF-8 encodes that as 0xC3 84 (two bytes).

请阅读Joel的文章绝对最小限度的每个软件开发人员绝对必须了解Unicode和字符集(No Excuses!)了解更多信息。

Please read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more info.

编辑:Christian发现自己的答案;原来这是Cocoon 3 SAX组件的一个问题(我猜这是alpha 3版本)。事实证明,如果您将XML作为String传递到 XMLGenerator 类中,那么在SAX解析过程中会出现问题,导致这种混乱。

Christian found the answer himself; turned out it was a problem in the Cocoon 3 SAX component (I guess it's the alpha 3 version). It turns out that if you pass an XML as a String into the XMLGenerator class, something will go wrong during SAX parsing causing this mess.

查找代码来查找Cocoon-stax中的实际问题:

I looked up the code to find the actual problem in Cocoon-stax:

if (XMLGenerator.this.logger.isDebugEnabled()) {
    XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();

如您所见,调用 getBytes()将创建一个带有JRE默认编码的字节数组,然后它将失败解析,这是因为XML声明为UTF-8,而数据现在再次以字节为单位,并且可能使用您的Windows代码页。

As you can see, the call getBytes() will create a Byte array with the JRE's default encoding which will then fail to parse. This is because the XML declares itself to be UTF-8 whereas the data is now in bytes again, and likely using your Windows codepage.

作为解决方法,一个可以使用以下内容:

As a workaround, one can use the following:

new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
       "UTF-8");

这将触发正确的内部动作(基督徒通过实验API发现)。

This will trigger the right internal actions (as Christian found out by experimenting with the API).

我有在Apache的bug跟踪器中打开了一个问题。

I've opened an issue in Apache's bug tracker.

编辑2:问题是固定的,将被包含在即将发布。

EDIT 2: The issue is fixed and will be included in an upcoming release.

这篇关于我有UTF-8,但仍然得到“1字节UTF-8序列的无效字节1”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆