Python(2.6)cStringIO unicode支持吗? [英] Python(2.6) cStringIO unicode support?

查看:150
本文介绍了Python(2.6)cStringIO unicode支持吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python pycurl模块从各种网页下载内容.由于我也想支持潜在的unicode文本,因此我一直在避免使用cStringIO.StringIO函数,该函数根据python文档: cStringIO-StringIO的更快版本

I'm using python pycurl module to download content from various web pages. Since I also wanted to support potential unicode text I've been avoiding the cStringIO.StringIO function which according to python docs: cStringIO - Faster version of StringIO

与StringIO模块不同,该模块无法接受无法编码为纯ASCII字符串的Unicode字符串.

Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

...不支持unicode字符串.实际上,它声明它不支持无法转换为ASCII字符串的unicode字符串.有人可以向我澄清一下吗?哪些可以转换,哪些不能转换?

... does not support unicode strings. Actually it states that it does not support unicode strings that can not be converted to ASCII strings. Can someone please clarify this to me? Which can and which can not be converted?

我已经使用以下代码进行了测试,并且似乎可以很好地与unicode配合使用:

I've tested with the following code and it seems to work just fine with unicode:

import pycurl
import cStringIO

downloadedContent = cStringIO.StringIO()
curlHandle = pycurl.Curl()
curlHandle.setopt(pycurl.WRITEFUNCTION, downloadedContent.write)
curlHandle.setopt(pycurl.URL, 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')

curlHandle.perform()
content = downloadedContent.getvalue()

fileHandle = open('unicode-test.txt','w')
for char in content:
    fileHandle.write(char)

文件已正确写入.我什至可以在控制台中打印全部内容,所有字符都可以正常显示...所以我感到困惑的是,cStringIO会在哪里失败?有什么原因我不应该使用它吗?

And the file is correctly written. I can even print the whole content in the console, all characters show up fine... So what I'm puzzled about is, where does the cStringIO fail ? Is there any reason why I should not use it?

[注意:我使用的是Python 2.6,需要坚持使用此版本]

[Note: I'm using Python 2.6 and need to stick to this version]

推荐答案

仅使用ASCII代码点(字节值00-7F十六进制)的任何文本都可以转换为ASCII.基本上,任何使用美式英语中不经常使用的字符的文本都不是ASCII.

Any text that only uses ASCII codepoints (byte values 00-7F hexadecimal) can be converted to ASCII. Basically any text that uses characters not often used in American English is not ASCII.

在示例代码中,您没有将输入转换为Unicode文本;您将其视为未解码的字节.有问题的测试页以UTF-8编码,您永远不会将其解码为Unicode.

In your example code, you are not converting the input to Unicode text; you are treating it as un-decoded bytes. The test page in question is encoded in UTF-8, and you never decode that to Unicode.

如果要将值解码为Unicode字符串,则无法将该字符串存储在cStringIO对象中.

If you were to decode the value to a Unicode string, you won't be able to store that string in a cStringIO object.

您可能想了解Unicode和文本编码(例如ASCII和UTF-8)之间的区别.我可以推荐:

You may want to read up on the difference between Unicode and text encodings such as ASCII and UTF-8. I can recommend:

  • Joel Spolsky's minimum Unicode article
  • The Python Unicode HOWTO.

这篇关于Python(2.6)cStringIO unicode支持吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆