转换双斜杠utf-8编码 [英] Converting double slash utf-8 encoding

查看:369
本文介绍了转换双斜杠utf-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法使它正常工作!我有一个保存游戏文件解析器中的文本文件,其中有许多UTF-8中文名称的字节格式,如source.txt中这样:

I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:

\ xe6 \ x89 \ x8e \ xe5 \ x8a \ xa0 \ xe6 \ x8b \ x89

\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89

但是,无论我如何将其导入到Python(3或2)中,我最多只能得到以下字符串:

But, no matter how I import it into Python (3 or 2), I get this string, at best:

\\ xe6 \\ x89 \\ x8e \\ xe5 \\ x8a \\ xa0 \\ xe6 \\ x8b \\ x89

\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89

我尝试过,就像其他线程建议的那样,将字符串重新编码为UTF-8,然后使用Unicode转义对其进行解码,如下所示:

I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:

stringName.encode("utf-8").decode("unicode_escape")

但是随后它弄乱了原始编码,并将其作为字符串:

But then it messes up the original encoding, and gives this as the string:

'æ\ x89 \x8eå\ x8a \xa0æ\ x8b \ x89'(打印此字符串将导致:æåæ)

'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing this string results in: æå æ )

现在,如果我手动将b +原始字符串复制并粘贴到文件名中并对其进行编码,那么我将获得正确的编码.例如:

Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:

b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'.encode("utf-8")

结果为:扎加拉"

但是,我无法以编程方式执行此操作.我什至不能摆脱双斜线.

But, I can't do this programmatically. I can't even get rid of the double slashes.

为清楚起见,source.txt包含单个反斜杠.我尝试了多种导入方式,但这是最常见的方式:

To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:

with open('source.txt','r',encoding='utf-8') as f_open:
    source = f_open.read()

好的,所以我单击了下面的答案(我认为),但这是可行的:

Okay, so I clicked the answer below (I think), but here is what works:

from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')

由于其他编码问题,我无法在整个文件上使用它,而是将每个名称提取为字符串(stringVariable),然后执行此操作!谢谢!

I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!

更清楚地说,原始文件不仅仅是这些混乱的utf编码.它仅将它们用于某些字段.例如,这是文件的开头:

To be more clear, the original file is not just these messed up utf encodings. It only uses them for certain fields. For example, here is the beginning of the file:

{'m_cacheHandles': ['s2ma\x00\x00CN\x1f\x1b"\x8d\xdb\x1fr \\\xbf\xd4D\x05R\x87\x10\x0b\x0f9\x95\x9b\xe8\x16T\x81b\xe4\x08\x1e\xa8U\x11',
                's2ma\x00\x00CN\x1a\xd9L\x12n\xb9\x8aL\x1d\xe7\xb8\xe6\xf8\xaa\xa1S\xdb\xa5+\t\xd3\x82^\x0c\x89\xdb\xc5\x82\x8d\xb7\x0fv',
                's2ma\x00\x00CN\x92\xd8\x17D\xc1D\x1b\xf6(\xedj\xb7\xe9\xd1\x94\x85\xc8`\x91M\x8btZ\x91\xf65\x1f\xf9\xdc\xd4\xe6\xbb',
                's2ma\x00\x00CN\xa1\xe9\xab\xcd?\xd2PS\xc9\x03\xab\x13R\xa6\x85u7(K2\x9d\x08\xb8k+\xe2\xdeI\xc3\xab\x7fC',
                's2ma\x00\x00CNN\xa5\xe7\xaf\xa0\x84\xe5\xbc\xe9HX\xb93S*sj\xe3\xf8\xe7\x84`\xf1Ye\x15~\xb93\x1f\xc90',
                's2ma\x00\x00CN8\xc6\x13F\x19\x1f\x97AH\xfa\x81m\xac\xc9\xa6\xa8\x90s\xfdd\x06\rL]z\xbb\x15\xdcI\x93\xd3V'],
'm_campaignIndex': 0,
'm_defaultDifficulty': 7,
'm_description': '',
'm_difficulty': '',
'm_gameSpeed': 4,
'm_imageFilePath': '',
'm_isBlizzardMap': True,
'm_mapFileName': '',
'm_miniSave': False,
'm_modPaths': None,
'm_playerList': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92,   'm_r': 36},
               'm_control': 2,
               'm_handicap': 0,
               'm_hero': '\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89',

'm_hero':字段之前的所有信息都不是utf-8.因此,如果文件仅由这些伪造的utf编码组成,则可以使用ShadowRanger的解决方案,但是当我已经将m_hero解析为字符串并尝试将其转换时,该方法将不起作用. Karin的解决方案确实可以做到这一点.

All of the information before the 'm_hero': field is not utf-8. So using ShadowRanger's solution works if the file is only made up of these fake utf-encodings, but it doesn't work when I have already parsed m_hero as a string and try to convert that. Karin's solution does work for that.

推荐答案

我假设您使用的是Python3.在Python 2中,默认情况下字符串是字节,因此它对您有用.但是在Python 3中,字符串是unicode并被解释为unicode,如果将字节字符串读取为unicode,这会使这个问题更加棘手.

I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.

此解决方案受到mgilson的回答的启发.我们可以使用 literal_eval :

This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using literal_eval:

from ast import literal_eval

with open('source.txt', 'r', encoding='utf-8') as f_open:
    source = f_open.read()
    string = literal_eval("b'{}'".format(source)).decode('utf-8')
    print(string)  # 扎加拉

这篇关于转换双斜杠utf-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆