如何将C字符串(char数组)转换成字符串形式,当有字符串中的非ASCII字符? [英] How to convert a C string (char array) into a Python string when there are non-ASCII characters in the string?

查看:224
本文介绍了如何将C字符串(char数组)转换成字符串形式,当有字符串中的非ASCII字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经嵌入在C程序中一个Python间preTER。假设C程序读取文件中的一些字节到一个char数组和学会(不知),该字节重新具有一定的编码present文本(例如,ISO 8859-1时,Windows 1252,或UTF-8)。我该如何去code这个字符数组的内容转换为Python字符串?

在Python字符串一般应类型的 UNI code - 对于例如, 0x93 中Windows的1252 EN codeD输入变为 U'\\ u0201c

我试图使用 PyString_De code ,但是当有字符串中的非ASCII字符它总是失败。这里是一个失败的例子:

 的#include< Python.h>
#包括LT&;&stdio.h中GT;INT主(INT ARGC,CHAR *的argv [])
{
     烧焦C_STRING [] = {(char)的0x93,0};
     *的PyObject py_string;     Py_Initialize();     py_string = PyString_De code(C_STRING,1,windows_1252,替换);
     如果(!py_string){
          PyErr_Print();
          返回1;
     }
     返回0;
}

该错误消息的Uni $ C $岑codeError:ASCIIcodeC无法连接code字符U'\\ u201c'位置0​​:顺序不在范围内(128),这表明 ASCII 编码用于即使我们指定 windows_1252 在调用 PyString_De code

以下code工作解决该问题通过使用 PyString_FromString 来创建理解过程codeD字节Python字符串,然后调用它的德code 方法:

 的#include< Python.h>
#包括LT&;&stdio.h中GT;INT主(INT ARGC,CHAR *的argv [])
{
     烧焦C_STRING [] = {(char)的0x93,0};
     *的PyObject生,*德codeD;     Py_Initialize();     原料= PyString_FromString(C_STRING);
     的printf(理解过程codeD:);
     PyObject_Print(原始的,标准输出,0);
     的printf(\\ n);
     德codeD = PyObject_CallMethod(原始的,德code,S,windows_1252);
     Py_DECREF(生);
     的printf(德codeD:);
     PyObject_Print(德codeD,标准输出,0);
     的printf(\\ n);
     返回0;
}


解决方案

PyString_De code做到这一点:

 的PyObject * PyString_De code(为const char * S,
    Py_ssize_t大小,
    为const char *编码,
    为const char *错误)
{
    的PyObject * V,*海峡;    海峡= PyString_FromStringAndSize(S,大小);
    如果(STR == NULL)
    返回NULL;
    V = PyString_AsDe codedString(STR,编码错误);
    Py_DECREF(STR);
    返回伏;
}

IOW,它基本上是你做你的第二个例子是什么 - 转换为字符串,然后去code中的字符串。这里的问题是来自PyString_AsDe codedString,而不是PyString_AsDe codedObject。 PyString_AsDe codedString确实PyString_AsDe codedObject,但随后试图产生的UNI code对象转换为默认编码字符串对象(对于你,貌似这是ASCII)。这就是失败。

我相信你需要做的两个电话 - 但您可以使用PyString_AsDe codedObject而不是要求蟒蛇德code的方法。是这样的:

 的#include< Python.h>
#包括LT&;&stdio.h中GT;INT主(INT ARGC,CHAR *的argv [])
{
     烧焦C_STRING [] = {(char)的0x93,0};
     *的PyObject py_string,* py_uni code;     Py_Initialize();     py_string = PyString_FromStringAndSize(C_STRING,1);
     如果(!py_string){
          PyErr_Print();
          返回1;
     }
     py_uni code = PyString_AsDe codedObject(py_string,windows_1252,替换);
     Py_DECREF(py_string);     返回0;
}

我不能完全肯定背后PyString_De code工作这样的理由是什么。一个很老的线程中的python-dev的似乎表明它有事情做与链接输出,但由于Python的方法不这样做,我不知道这是仍然适用。

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?

The Python string should in general be of type unicode—for instance, a 0x93 in Windows-1252 encoded input becomes a u'\u0201c'.

I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode.

The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

解决方案

PyString_Decode does this:

PyObject *PyString_Decode(const char *s,
    		  Py_ssize_t size,
    		  const char *encoding,
    		  const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

这篇关于如何将C字符串(char数组)转换成字符串形式,当有字符串中的非ASCII字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆