Java在Python中修改了UTF-8字符串 [英] Java modified UTF-8 strings in Python

查看:227
本文介绍了Java在Python中修改了UTF-8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过Python与Java应用程序连接。我需要能够构造包含utf-8字符串的字节序列。 Java在DataInputStream.readUTF()中使用了一个修改过的utf-8编码,python不支持这种编码(至少

I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain utf-8 strings. Java uses a modified utf-8 encoding in DataInputStream.readUTF() which is not supported by python (yet at least)

有人能指出我在python中构造java修改的utf-8字符串的正确方向吗?

Can anybody point me in the right direction to construct java modified utf-8 strings in python?

更新#1:要了解有关java修改的utf-8的更多信息,请从550行的DataInput接口中查看readUTF方法此处,或这里是Java SE文档

Update #1: To see a little more about the java modified utf-8 check out the readUTF method from the DataInput interface on line 550 here, or here in the Java SE docs.

更新#2:我正在尝试与第三方JBoss网络应用程序连接,该应用程序使用这种修改后的utf8格式通过调用DataInputStream.readUTF来读取字符串中的字符串(对于普通java utf8字符串操作的任何混淆,不好意思)。

Update #2: I am trying to interface with a third party JBoss web app which is using this modified utf8 format to read in strings via POST requests by calling DataInputStream.readUTF (sorry for any confusion regarding normal java utf8 string operation).

谢谢你dvance。

推荐答案

您可以忽略修改后的UTF-8编码(MUTF-8)它作为UTF-8。在Python方面,你可以像这样处理它,

You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,


  1. 将字符串转换为普通的UTF-8并将字节存储在缓冲区中。

  2. 在big-endian中将2字节缓冲区长度(不是字符串长度)写为二进制。

  3. 编写整个缓冲区。

我用PHP完成了这个并且Java没有' t抱怨我的编码(至少在Java 5中)。

I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).

MUTF-8主要用于JNI和其他具有空终止字符串的系统。与普通UTF-8的唯一区别是U + 0000是如何编码的。普通UTF-8使用1字节编码(0x00),而MUTF-8使用2字节(0xC0 0x80)。首先,您不应该在任何Unicode文本中使用U + 0000(无效的代码点)。其次, DataInputStream.readUTF()不强制执行编码,因此它乐意接受任何一个。

MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.

编辑: Python代码应该如下所示,

The Python code should look like this,

def writeUTF(data, str):
    utf8 = str.encode('utf-8')
    length = len(utf8)
    data.append(struct.pack('!H', length))
    format = '!' + str(length) + 's'
    data.append(struct.pack(format, utf8))

这篇关于Java在Python中修改了UTF-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆