当字符串中有非ASCII字符时,如何将C字符串(字符数组)转换为Python字符串? [英] How to convert a C string (char array) into a Python string when there are non-ASCII characters in the string?

查看:169
本文介绍了当字符串中有非ASCII字符时,如何将C字符串(字符数组)转换为Python字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在C程序中嵌入了一个Python解释器。假设C程序从文件中读取一些字节到一个字符数组,并学习(不知何故)字节表示具有某种编码(例如ISO 8859-1,Windows-1252或UTF-8)的文本。如何将此char数组的内容解码为Python字符串?

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?

Python字符串通常应为 unicode - 例如,在Windows-1252编码输入中的 0x93 变为 u'\\\ȁc'

The Python string should in general be of type unicode—for instance, a 0x93 in Windows-1252 encoded input becomes a u'\u0201c'.

我试图使用 PyString_Decode ,但是当字符串中有非ASCII字符时,它总是失败。下面是一个失败的示例:

I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

错误消息是 UnicodeEncodeError:'ascii'codec不能在位置0编码字符u'\\\':序数不在范围(128)中,这表示 ascii 编码是即使我们在调用 PyString_Decode 时指定 windows_1252

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode.

以下代码通过使用 PyString_FromString 创建未解码字节的Python字符串,然后调用 decode $ c>方法:

The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}


推荐答案

PyString_Decode: p>

PyString_Decode does this:

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW,它基本上是你在第二个例子中做的 - 到字符串,然后解码字符串。这里的问题来自PyString_AsDecodedString,而不是PyString_AsDecodedObject。 PyString_AsDecodedString不使用PyString_AsDecodedObject,而是尝试将生成的unicode对象转换为带有默认编码的字符串对象(对于你来说,看起来像是ASCII)。

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

我相信你需要做两个调用 - 但你可以使用PyString_AsDecodedObject而不是调用pythondecode方法。例如:

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}



我不完全确定PyString_Decode工作方式的原因是什么。 python-dev上的旧线程似乎表明它与链接输出有关,但由于Python方法不会做同样的事情,我不知道这是否仍然相关。

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

这篇关于当字符串中有非ASCII字符时,如何将C字符串(字符数组)转换为Python字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆