比较Python中的特殊字符 [英] Comparing special characters in Python

查看:142
本文介绍了比较Python中的特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串,其值为Opérations。在我的脚本中,我将读取一个文件并进行一些比较。比较字符串时,我从同一来源复制并放置在python脚本中的字符串不等于在脚本中读取相同文件时收到的字符串。同时打印两个字符串会给我操作。但是,当我将其编码为utf-8时,我注意到了差异。

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.


  • b'Ope\xcc\x81rations'

  • b'Op\xc3\xa9rations'

我的问题是我该怎么做才能确保比较这些字符串时,我的python脚本中的特殊字符与文件内容相同。

My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.

推荐答案

请注意:

您正在谈论两种类型的字符串,字节字符串和unicode字符串。每个都有一个将其转换为其他类型的字符串的方法。 Unicode字符串具有产生字节的.encode()方法,而字节字符串具有产生Unicode的.decode()方法。这意味着:

You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:


unicode.enocde()---->字节

unicode.enocde() ----> bytes


bytes.decode()-----> unicode

bytes.decode() -----> unicode

UTF-8 无疑是最流行的Unicode存储和传输编码。它为每个代码点使用可变数量的字节。代码点的值越高,它在UTF-8中需要的字节就越多。

and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.

到点:

如果您将字符串重新定义为两个Byte字符串和unicode字符串,如下所示:

If you redefine your string to two Byte strings and unicode strings, as follwos:

a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'

b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'

您会看到:

print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))

print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))

输出:

a_byte lenght is:  11
b_byte lenght is:  10

所以您看到它们并不相同。

So you see they are not the same.

我的解决方案:

如果您不想感到困惑,则可以使用 repr(),并且在打印a_byte时,b_byte打印Opérations作为输出,但是:

If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:

print repr(a_byte),repr(b_byte)

将返回:

'Ope\xcc\x81rations','Op\xc3\xa9rations'

在比较之前,您也可以将Unicode规范化为 @ Daniel的答案,如下所示:

You can also normalize the unicode before comparison as @Daniel's answer, as follows:

from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))

这篇关于比较Python中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆