Python:比较带重音符号的字符串不起作用 [英] Python: Comparing strings with accented characters does not work

查看:80
本文介绍了Python:比较带重音符号的字符串不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对python很陌生。我正在尝试从另一个列表中删除出现在一个列表中的文件。列表是通过在Mac和Windows上重定向ll -R产生的(但由于进行了合并,排序等操作(使用其他python脚本),因此进行了一些处理)。
某些文件名带有重音符号和特殊符号。即使这些字符串相同(在包含列表的文件中打印相同并且看起来相同),这些字符串也不相等。

I'm quite new to python. I am trying to remove files that appear on one list from another list. The lists were produced by redirecting ll -R on mac and on windows (but have gone some processing since - merging, sorting, etc - using other python scripts). Some file names have accents and special symbols. These strings, even though they are the same (printed the same and look the same in the files that contain the lists) are found to be not equal.

我发现有关如何将字符串与Unicode中的特殊字符进行比较的主题:
Python字符串比较-特殊字符/ Unicode字符的问题
这与我的问题非常相似。我对编码以及如何更改字符串编码做了更多阅读。
但是,我尝试了在编解码器文档中可以找到的所有编解码器:
https://docs.python.org/2/library/codecs.html
对于所有可能的编解码器对,两个字符串都不相等(请参见下面的程序-尝试了解码和编码选项)。

I found the thread about how to compare strings with special characters in unicode: Python String Comparison--Problems With Special/Unicode Characters This is quite similar to my problem. I did some more reading on encoding and how to change the encoding of strings. However, I tried all codecs I could find in the codecs documentation: https://docs.python.org/2/library/codecs.html For all possible pairs of codecs the two strings are not equal (see program below - tried both decode and encode options).

当我一个接一个地浏览两个字符串中的字符时,重音e在一个文件中显示为重音e(一个字符),并显示为两个字符(

When I go over the characters in the two strings one by one the accented e appears as an accented e (one char) in one file and as two chars (e and printable-as-space) in the other.

任何想法都会受到赞赏。

Any ideas would be appreciated.

我缩小了将两个文本文件压缩到一行,每个单词一个单词(显然带有重音符号)。
我将文本文件上传到了保管箱: testfilesindata testmissingfiles (但尚未尝试从Dropbox下载新副本)。

I narrowed down the two text files to one line one word each (obviously with an accent). I uploaded the text files to dropbox: testfilesindata and testmissingfiles (but haven't tried to download a fresh copy from dropbox).

非常感谢!

PS。
很抱歉弄乱链接。我没有声望10 ...

PS. Sorry about messing with the links. I don't have reputation 10 ...

#!/usr/bin/python3

import sys

codecs = [ 'ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp720      ', 'cp737   ', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856   ', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874     ', 'cp875   ', 'cp932', 'cp949', 'cp950', 'cp1006   ', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r   ', 'koi8_u      ', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig' ]

file1 = open('testmissingfiles','r')
file2 = open('testfilesindata','r')

list1 = file1.readlines()
list2 = file2.readlines()

word1 = list1[0].rstrip('\n')
word2 = list2[0].rstrip('\n')

for i in range(0,len(codecs)-1):
    for j in range(0,len(codecs)-1):
        try:
            encoded1 = word1.decode(codecs[i])
            encoded2 = word2.decode(codecs[j])

            if encoded1 == encoded2:
                sys.stdout.write('Succeeded with ' + codecs[i] + ' & ' + codecs[j] + '\n')
        except:
            pass


推荐答案

使用 unicodedata.normalize 将to字符串标准化为相同的标准格式:

Use unicodedata.normalize to normalize the to strings to the same normal form:

import unicodedata

encoded1 = unicodedata.normalize('NFC', word1.decode('utf8'))
encoded2 = unicodedata.normalize('NFC', word2.decode('utf8'))

这篇关于Python:比较带重音符号的字符串不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆