Google App Engine中的Python unicode行为 [英] Python unicode behaviour in Google App Engine

查看:144
本文介绍了Google App Engine中的Python unicode行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我完全和gae混淆了。我有一个脚本,它执行一个post请求(使用来自Google App Engine API的urlfetch)作为响应,我们得到一个cp1251编码的html页面。

然后我解码它,使用.decode('cp1251')并用lxml解析。



我的代码在我的本地机器上工作得很好:

  import re 
import leaf #simple wrapper for lxml
weekdaysD = {u'понедельник':1,u'вторник':2,u'среда' :3,u'четверг':4,u'пятница':5,u'суббота':6}
document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251')) )))
table = document.get('table')
trs = table('tr')#leaf语法
for tr in trs:
tds = tr.xpath ('td')
for td in tds:
if td.colspan =='3':
curweek = re.findall('\w +(?= \-)' ,td.text)[0]
curday = weekdaysD [td.text.split(u',')[0]]

但是当我将它部署到gae时,我得到:

  curday = weekdaysD [td。 text.split(u',')[0]] 
KeyError:u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\ xbd \xd0\xb8\xd0\xba'

非Unicode字符如何存在所有?为什么一切都在当地?
我已经尝试了所有的解码代码放置在我的代码中 - 没有任何帮助。
现在我坚持了几天。



UPD:另外,如果我在GAE上添加到我的脚本:

 打印类型(weekdaysD.keys()[0]),类型(td.text.split(u',')[ 0])

它既返回为'unicode'。所以,我相信html被正确解码了。这可能是GAE上的lxml的东西吗?

解决方案

好了,添加.encode('latin1')的解决方法。decode 'utf-8','忽略')做到了。我希望我能解释为什么它的行为如此。


I got completely confused with gae. I have a script, that does a post request(using urlfetch from Google App Engine api) as a response we get a cp1251 encoded html page.

Then I decode it, using .decode('cp1251') and parse with lxml.

My code works totally fine on my local machine:

import re
import leaf #simple wrapper for lxml
weekdaysD={u'понедельник':1, u'вторник':2, u'среда':3, u'четверг':4, u'пятница':5, u'суббота':6}
document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251'))))
table=document.get('table')
trs=table('tr') #leaf syntax
for tr in trs:
    tds=tr.xpath('td')
    for td in tds:
        if td.colspan=='3':
            curweek=re.findall('\w+(?=\-)', td.text)[0]               
            curday=weekdaysD[td.text.split(u',')[0]]

but when I deploy it to gae, I get:

curday=weekdaysD[td.text.split(u',')[0]]
KeyError: u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbd\xd0\xb8\xd0\xba'

How is non unicode characters there at all? And why is everything ok locally? I've tried all variations of decoding\encoding placing in my code - nothing helped. I'm stuck for a few days now.

UPD: also, if I add to my script on GAE:

print type(weekdaysD.keys()[0]), type(td.text.split(u',')[0]) 

It returns both as 'unicode'. So, I belive that html was decoded correctly. Could it be something with lxml on GAE?

解决方案

Well, the workaround of adding .encode('latin1').decode('utf-8', 'ignore') did the trick. I wish I could explain why it behaves so.

这篇关于Google App Engine中的Python unicode行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆