Python字符串中特殊字符存储不一致 [英] Python Inconsistent Special Character Storage In String

查看:176
本文介绍了Python字符串中特殊字符存储不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

版本为Python 3.7.我刚刚发现python有时会将字符ñ存储在具有多种表示形式的字符串中,而对于为什么或如何处理它,我完全不知所措.

Version is Python 3.7. I've just found out python sometimes will store the character ñ in a string with multiple representations and I'm completely at a loss as to why or how to deal with it.

我不确定显示此问题的最佳方法,所以我将仅显示一些代码输出.

I'm not sure the best way to show this issue, so I'm just going to show some code output.

我有两个字符串s1和s2都设置为相等的'Dan Peña'

I have two strings, s1 and s2 both set to equal 'Dan Peña'

它们都是字符串类型.

我可以运行代码:

print(s1 == s2) # prints false
print(len(s1)) # prints 8
print(len(s2)) # prints 9
print(type(s1)) # print 'str'
print(type(s2)) # print 'str'
for i in range(len(s1)):
    print(s1[i] + ", " + s2[i])

循环的输出为:

D, D
a, a
n, n
 ,  
P, P
e, e
ñ, n
a, ~

那么,是否有任何python方法来处理这些不一致问题,或者至少有一些关于python什么时候使用哪种表示形式的规范?

So, are there any python methods for dealing with these inconsistencies, or at least some specification as to when python will use which representation?

很高兴知道Python为什么会选择以这种方式实现.

It would also be nice to know why Python would choose to implement this way.

正在从Django数据库中检索一个字符串,而另一个字符串则是从解析列表目录调用中的文件名获得的字符串中.

One string is being retrieved from a django database and the other string is from a string obtained from parsing a filename from a list dir call.

from app.models import Model
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def handle(self, *args, **kwargs):
        load_dir = "load_dir_name"
        save_dir = "save_dir"

        files = listdir(load_dir)
        save_file_map = {file[:file.index("_thumbnail.jpg")]: f"{save_dir}/{file}" for file in files}
        for obj in Model.objects.all():
            s1 = obj.title
            save_file_path = save_file_map[s1] # Key error when encountering ñ.

但是,当我搜索save_file_map字典时,发现与s1完全相同的键,除了ñ编码为字符n~而不是字符ñ.

However, when I search through the save_file_map dict I find a key that is exactly the same as s1 except the ñ is encoded as characters n~ rather than character ñ.

请注意,我在上面的代码中使用list dir加载的文件首先是基于obj.title字段命名的,因此应确保该名称的文件位于load_dir目录中./p>

Note that the files I load in the above code with list dir are named base on the obj.title field in the first place, so it should be guaranteed that a file with the name is in the load_dir directory.

推荐答案

您将要规范化字符串以使用相同的表示形式.现在,其中一个正在使用n字符+波浪号字符(2个字符),而另一个正在使用单个字符表示n与波浪号.

You'll want to normalize the strings to use the same representation. Right now, one of them is using an n character + a tilde character (2 chars), while the other is using a single character representing an n with a tilde.

unicodedata.normalize应该做您想要的.请参阅文档此处.

unicodedata.normalize should do what you want. See the docs here.

您将这样称呼:unicodedata.normalize('NFC', s1). 'NFC'告诉unicodedata.normalize您要对所有内容使用组合形式,例如的1个字符的版本.除了'NFC'之外,文档中还提供了其他选项,您完全可以使用该选项.

You'll want to call this like so: unicodedata.normalize('NFC', s1). 'NFC' tells unicodedata.normalize that you want to use the composed forms for everything, e.g. the 1 char version of . There are other options supplied in the docs besides 'NFC', which one you use is totally up to you.

现在,您可以在何时进行规范化(我不知道您的应用程序的结构).例如,您可以在插入数据库之前进行标准化,或者在每次从数据库读取数据时进行标准化.

Now, at what point you normalize is up to you (I don't know how you app is structured). For example you could normalize before inserting into the database, or normalize every time you read from the database.

这篇关于Python字符串中特殊字符存储不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆