Python字符串编码和== [英] Python string encodings and ==

查看:102
本文介绍了Python字符串编码和==的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为应该在python中的字符串不是==时遇到一些麻烦,并且我相信这与它们的编码方式有关.基本上,我会解析一些以逗号分隔的值,这些值存储在zip归档文件中(GTFS供稿专门针对好奇的人).

I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).

我正在python中使用ZipFile模块打开zip归档文件中的某些文件,然后将其中的文本与某些已知值进行比较.这是一个示例文件:

I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:

agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en

我正在使用的代码试图识别字符串"agency_id"在文本第一行中的位置,以便我可以在任何后续行中使用相应的值.这是代码段:

The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:

zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")

line_num = 0
agencyline = agency_file.readline()
while agencyline:
    if line_num == 0:
        # this is the header, all we care about is the agency_id
        lineparts = agencyline.split(",")
        position = -1
        counter = 0
        for part in lineparts:
            part = part.strip()
            if part == "agency_id":
                position = counter              
        counter += 1
        line_num += 1
        agencyline = agency_file.readline()
    else:
        .....

此代码适用于某些zip存档,但不适用于其他zip存档.我做了一些研究,并尝试打印repr(part),我得到的是'\ xef \ xbb \ xbfagency_id'而不是'agency_id'.有谁知道这是怎么回事,我该如何解决?感谢您的所有帮助!

This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!

推荐答案

这是字节顺序标记,它告诉文件编码,对于UTF-16和UTF-32,它也告诉文件的结尾.您可以解释它或检查它并将其从字符串中删除. 要删除它,您可以这样做:

That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string. To remove it you could do this:

import codecs

unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))

这篇关于Python字符串编码和==的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆