python raw_input带有字符串重音的奇怪行为 [英] python raw_input odd behavior with accents containing strings

查看:79
本文介绍了python raw_input带有字符串重音的奇怪行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个程序,要求用户提供包含重音符号的输入。测试用户输入的字符串是否与程序中声明的字符串匹配。如您在下面看到的,我的代码不起作用:

I'm writing a program that asks the user for input that contains accents. The user input string is tested to see if it matches a string declared in the program. As you can see below, my code is not working:

# -*- coding: utf-8 -*-

testList = ['má']
myInput = raw_input('enter something here: ')

print myInput, repr(myInput)
print testList[0], repr(testList[0])
print myInput in testList






使用pydev进行日食输出




output in eclipse with pydev

enter something here: má
m√° 'm\xe2\x88\x9a\xc2\xb0'
má 'm\xc3\xa1'
False






< h2> IDLE中的输出


output in IDLE

enter something here: má
má u'm\xe1'
má 'm\xc3\xa1'

Warning (from warnings module):
  File "/Users/ryanculkin/Desktop/delete.py", line 8
    print myInput in testList
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

比较两个字符串时,如何使我的代码打印True?

How can I get my code to print True when comparing the two strings?

另外,我注意到在同一输入上运行此代码的结果会有所不同,具体取决于关于我使用eclipse还是IDLE。为什么是这样?我最终的目标是将程序发布到网络上。因为结果似乎非常不稳定,所以我需要注意什么吗?

Additionally, I note that the result of running this code on the same input is different depending on whether I use eclipse or IDLE. Why is this? My eventual goal is to put my program on the web; is there anything that I need to be aware of, since the result seems to be so volatile?

推荐答案

您正在运行的内容是 raw_input 给您一个字节字符串,但是您要比较的字符串是Unicode字符串。 Python 2尝试将它们转换为通用类型以进行比较,但是失败了,因为它无法猜测字节字符串的编码-因此,您的解决方案是显式进行转换。

What you're running into is that raw_input gives you a byte string, but the string you're comparing against is a Unicode string. Python 2 tries to convert them to a common type to compare, but this fails because it can't guess the encoding of the byte string - so, your solution is to do the conversion explicitly.

通常,应使程序中的所有字符串都以unicode字符串的形式浮动-以字节读取的任何内容都将立即转换为unicode;程序中作为文字的任何东西,都应使其成为unicode文字,除非出于某种原因它明确需要为字节串。这样会产生 unicode三明治,这通常会使您的生活更轻松。

As a rule, you should keep all strings in your program floating around as unicode strings - anything that you read in as bytes convert to unicode straight away; anything you have as a literal in your program, make it a unicode literal unless it explicitly needs to be a bytestring for some reason. This results in the unicode sandwich, which will generally make your life easier.

对于文字,您要么将字符串声明为u'má',要么具有:

For the literals, you either want to declare your strings as u'má', or have:

from __future__ import unicode_literals

在脚本顶部附近,使无前缀字符串 成为unicode。

near the top of your script to make 'un-prefixed strings' unicode. The error you're getting implies you've already done this bit.

要读取其中的unicode字符串,您需要意识到 raw_input 为您提供了一个字节串-因此,您需要使用其 .decode 方法对其进行转换。您需要传递 .decode 您的STDIN编码-可以作为 sys.stdin.encoding 使用(不要只是假设这是UTF8-通常会,但并非总是如此-所以,整行将是:

To read a unicode string in, you need to realise that raw_input gives you a bytestring - so, you need to convert it using its .decode method. You need to pass .decode the encoding of your STDIN - which is available as sys.stdin.encoding (don't just assume that this is UTF8 - it often will be, but not always) - so, the whole line will be:

string = raw_input(...).decode(sys.stdin.encoding) 

但是到目前为止解决此问题的最简单方法是,如果可以的话,升级到Python 3- input()(其行为类似于Py2 raw_input 否则)为您提供一个unicode字符串(它会为您调用 .decode ,这样您就不必记住它了),并且默认情况下,无前缀字符串是unicode字符串。所有这些都使使用带重音符号的字符的工作变得更加轻松-从本质上讲,这意味着您尝试的逻辑在Py3中将正常工作,因为它可以执行正确的操作。

But by far the easiest way around this is to upgrade to Python 3 if you can - there, input() (which behaves like the Py2 raw_input otherwise) gives you a unicode string (it calls .decode for you so you don't have to remember it), and unprefixed strings are unicode strings by default. Which all makes for a much easier time working with accented characters - it essentially implies that the logic you were trying would just work in Py3, since it does the right thing.

但是,请注意,您看到的错误仍然会在Py3中显示-但由于默认情况下它做对了,因此您必须努力工作才能遇到它。但是,如果您这样做了,则比较结果将为False,而不会发出警告-Py3永远不会尝试在字节和unicode字符串之间进行隐式转换,因此任何字节字符串都将始终不等于任何unicode字符串,并且尝试对它们进行排序引发异常。

Note, however, that the error you're seeing would still manifest in Py3 - but since it does the right thing by default, you have to work hard to run into it. But if you did, the comparison would just be False, with no warning - Py3 doesn't ever try to implictly convert between byte and unicode strings, so any byte string will always compare unequal to any unicode string, and trying to order them will throw an exception.

这篇关于python raw_input带有字符串重音的奇怪行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆