比较Python中的特殊字符 [英] Comparing special characters in Python

查看：142 发布时间：2020/10/1 1:08:14 python python-3.x python-2.7 character-encoding

本文介绍了比较Python中的特殊字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个字符串，其值为Opérations。在我的脚本中，我将读取一个文件并进行一些比较。比较字符串时，我从同一来源复制并放置在python脚本中的字符串不等于在脚本中读取相同文件时收到的字符串。同时打印两个字符串会给我操作。但是，当我将其编码为utf-8时，我注意到了差异。

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.

b'Ope\xcc\x81rations'

b'Op\xc3\xa9rations'

我的问题是我该怎么做才能确保比较这些字符串时，我的python脚本中的特殊字符与文件内容相同。

My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.

推荐答案

请注意：

您正在谈论两种类型的字符串，字节字符串和unicode字符串。每个都有一个将其转换为其他类型的字符串的方法。 Unicode字符串具有产生字节的.encode（）方法，而字节字符串具有产生Unicode的.decode（）方法。这意味着：

You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:

unicode.enocde（）---->字节

unicode.enocde() ----> bytes

和

bytes.decode（）-----> unicode

bytes.decode() -----> unicode

和 UTF-8 无疑是最流行的Unicode存储和传输编码。它为每个代码点使用可变数量的字节。代码点的值越高，它在UTF-8中需要的字节就越多。

and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.

到点：

如果您将字符串重新定义为两个Byte字符串和unicode字符串，如下所示：

If you redefine your string to two Byte strings and unicode strings, as follwos:

a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'

b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'

您会看到：

print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))

print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))

输出：

a_byte lenght is:  11
b_byte lenght is:  10

所以您看到它们并不相同。

So you see they are not the same.

我的解决方案：

如果您不想感到困惑，则可以使用 repr（），并且在打印a_byte时，b_byte打印Opérations作为输出，但是：

If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:

print repr(a_byte),repr(b_byte)

将返回：

'Ope\xcc\x81rations','Op\xc3\xa9rations'

在比较之前，您也可以将Unicode规范化为 @ Daniel的答案，如下所示：

You can also normalize the unicode before comparison as @Daniel's answer, as follows:

from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))

这篇关于比较Python中的特殊字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较Python中的特殊字符 [英] Comparing special characters in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较Python中的特殊字符 [英] Comparing special characters in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭