Google App Engine中的Python unicode行为 [英] Python unicode behaviour in Google App Engine

查看：144 发布时间：2018/5/3 19:30:27 python google-app-engine unicode cp1251

本文介绍了Google App Engine中的Python unicode行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我完全和gae混淆了。我有一个脚本，它执行一个post请求（使用来自Google App Engine API的urlfetch）作为响应，我们得到一个cp1251编码的html页面。

然后我解码它，使用.decode（'cp1251'）并用lxml解析。

我的代码在我的本地机器上工作得很好：

  import re 
 import leaf #simple wrapper for lxml 
 weekdaysD = {u'понедельник'：1，u'вторник'：2，u'среда' ：3，u'четверг'：4，u'пятница'：5，u'суббота'：6} 
 document = leaf.parse（leaf.strip_symbols（leaf.strip_accents（html_in_cp1251.decode（'cp1251'）） ）））
 table = document.get（'table'）
 trs = table（'tr'）#leaf语法
 for tr in trs：
 tds = tr.xpath （'td'）
 for td in tds：
 if td.colspan =='3'：
 curweek = re.findall（'\w +（？= \-）' ，td.text）[0] 
 curday = weekdaysD [td.text.split（u'，'）[0]]

但是当我将它部署到gae时，我得到：

  curday = weekdaysD [td。 text.split（u'，'）[0]] 
 KeyError：u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\ xbd \xd0\xb8\xd0\xba'

非Unicode字符如何存在所有？为什么一切都在当地？
我已经尝试了所有的解码代码放置在我的代码中 - 没有任何帮助。
现在我坚持了几天。

UPD：另外，如果我在GAE上添加到我的脚本：
打印类型（weekdaysD.keys（）[0]），类型（td.text.split（u'，'）[ 0]）
它既返回为'unicode'。所以，我相信html被正确解码了。这可能是GAE上的lxml的东西吗？
解决方案
好了，添加.encode（'latin1'）的解决方法。decode 'utf-8'，'忽略'）做到了。我希望我能解释为什么它的行为如此。

I got completely confused with gae. I have a script, that does a post request(using urlfetch from Google App Engine api) as a response we get a cp1251 encoded html page.

Then I decode it, using .decode('cp1251') and parse with lxml.

My code works totally fine on my local machine:
import re import leaf #simple wrapper for lxml weekdaysD={u'понедельник':1, u'вторник':2, u'среда':3, u'четверг':4, u'пятница':5, u'суббота':6} document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251')))) table=document.get('table') trs=table('tr') #leaf syntax for tr in trs: tds=tr.xpath('td') for td in tds: if td.colspan=='3': curweek=re.findall('\w+(?=\-)', td.text)[0] curday=weekdaysD[td.text.split(u',')[0]]
but when I deploy it to gae, I get:
curday=weekdaysD[td.text.split(u',')[0]] KeyError: u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbd\xd0\xb8\xd0\xba'
How is non unicode characters there at all? And why is everything ok locally? I've tried all variations of decoding\encoding placing in my code - nothing helped. I'm stuck for a few days now.

UPD: also, if I add to my script on GAE:
print type(weekdaysD.keys()[0]), type(td.text.split(u',')[0])
It returns both as 'unicode'. So, I belive that html was decoded correctly. Could it be something with lxml on GAE?
解决方案
Well, the workaround of adding .encode('latin1').decode('utf-8', 'ignore') did the trick. I wish I could explain why it behaves so.

这篇关于Google App Engine中的Python unicode行为的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Google App Engine中的Python unicode行为 [英] Python unicode behaviour in Google App Engine

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Google App Engine中的Python unicode行为 [英] Python unicode behaviour in Google App Engine

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭