Python 3.4十六进制到日文字符 [英] Python 3.4 hex to Japanese Characters

查看:1559
本文介绍了Python 3.4十六进制到日文字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个脚本来从我的网站上提取含有日语字符的信息。到目前为止,我的脚本将数据从网站上拉出来。



它以字符串形式返回:

 \xe5 \xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\ xb4\xe3\x82\x8c\xe5\xa7\xbf

使用一个在线十六进制文字工具,我给了:



年に一度の晴れ姿



我知道这个短语是正确的,但我的问题是如何在python中转换它?当我运行像:



$ $ $ $ $ $ xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\\ \\ xa7\xbf
print(name)

我给了这个: p>

 å¹'ã«ä¸åº|ã®æ'ã姿

我试过

  name.decode(十六进制)

但是似乎Python 3.4没有 str。 decode(),所以我试图将它转换成一个字节对象,并以这种方式进行解码,但仍然失败。



编辑1: / p>

跟进问题,如果你不介意:像解决方案,Martijn Pieters给了这个作品:

  name =\xe2\x80\x9c\xe5\xa4\x8f\xe7\xa5\xad\xe3\x82\x8a\xe3 \x83 \x87\xe3\x83\xbc\xe3\x8 3\x88\xe2\x80\x9d\xe7\xb5\xa2\xe7\x80\xac $ c 
name = name.encode('latin1')
print(name.decode('Utf-8'))

但是,如果我有一个文件中的名称的引号,我这样做:

  with open('0N.txt',mode ='r',encoding ='utf-8')as f:
name = f.read()
name = name.encode('latin1')
print(name.decode('Utf-8'))

它不起作用...任何想法?

解决方案

您将Python 表示与内容混淆。您在Python字符串文字中显示 \xhh hex escape,以保持显示的值ASCII安全可重复。



您有UTF-8数据:

 >>> name = b\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xe \\ xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf
>>>> name.decode('utf8')
'\\\年\\\に\\\一\\\度\\\の\\\晴\\\れ\\\姿'
>>> print(name.decode('utf8'))
年に一度の晴れ姿

我使用 bytes()字符串文字,使用 b'...'。如果你的数据不是一个字节对象,你有一个 Mojibake ,需要首先编码为字节:

  name.encode('latin1' ).decode('utf8')

拉丁语1将代码点一对一映射到字节,所以在这种数据的情况下,通常是一个安全的赌注。这可能是您使用不同的编解码器的Mojibake,这取决于您如何检索数据。



如果使用 open()从文件中读取数据,您指定错误的编码或依赖于您的平台默认值。使用打开(filename,encoding ='utf8')来补救。



如果你使用请求库从网站加载,请考虑到 response.text 属性使用 latin- 1 作为默认编解码器,如果a)该站点未指定编解码器,并且b)响应具有 text / * mime类型。如果这是源自HTML,通常编解码器是HTML标头的一部分。使用像BeautifulSoup这样的库来处理HTML(使用 response.content 原始字节),它会为您检测到这样的信息。



如果所有其他失败, ftfy 图书馆仍然可以修复Mojibake;它使用专门构造的编解码器来反转常见错误。


I am currently writing a script to pull information off my site which contains Japanese characters. So far I have my script pulling out the data off the site.

It has return as a string:

"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf" 

Using an online hex to text tool, I am giving:

年に一度の晴れ姿

I know this phrase is correct, but my question is how do I convert it in python? When I run something like:

name = "\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
print(name)

I am giving this:

å¹´ã«ä¸åº¦ã®æ´ã姿

I've tried to

name.decode("hex")

But it seems like Python 3.4 doesn't have str.decode(), so I tried to convert it to a bytes object and decode it that way, which still failed.

Edit 1:

Follow up question if you don't mind: Like the solution, Martijn Pieters gave this works:

name = "\xe2\x80\x9c\xe5\xa4\x8f\xe7\xa5\xad\xe3\x82\x8a\xe3\x83\x87\xe3\x83\xbc\xe3\x8‌​3\x88\xe2\x80\x9d\xe7\xb5\xa2\xe7\x80\xac \xe7\xb5\xb5\xe9\x87\x8c" 
name = name.encode('latin1') 
print(name.decode('Utf-8')) 

However if I have what's in the quotes for name in a file and I do this:

with open('0N.txt',mode='r',encoding='utf-8') as f: 
    name = f.read() 
name = name.encode('latin1') 
print(name.decode('Utf-8')) 

It doesn't work...any ideas?

解决方案

You are confusing the Python representation with the contents. You are shown \xhh hex escapes used in Python string literals to keep the displayed value ASCII safe and reproducable.

You have UTF-8 data here:

>>> name = b"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
>>> name.decode('utf8')
'\u5e74\u306b\u4e00\u5ea6\u306e\u6674\u308c\u59ff'
>>> print(name.decode('utf8'))
年に一度の晴れ姿

Note that I used a bytes() string literal there, using b'...'. If your data is not a bytes object you have a Mojibake and need to encode to bytes first:

name.encode('latin1').decode('utf8')

Latin 1 maps codepoints one-on-one to bytes, so that's usually a safe bet to use in case of such data. It could be that you have a Mojibake in a different codec, it depends on how you retrieved the data.

If used open() to read the data from a file, you either specified the wrong encoding or relied on your platform default. use open(filename, encoding='utf8') to remedy that.

If you used the requests library to load this from a website, take into account that the response.text attribute uses latin-1 as the default codec if a) the site didn't specify a codec and b) the response has a text/* mime-type. If this is sourced from HTML, usually the codec is part of the HTML headers instead. Use a library like BeautifulSoup to handle HTML (using the response.content raw bytes) and it'll detect such information for you.

If all else fails, the ftfy library may still be able to fix a Mojibake; it uses specially constructed codecs to reverse common errors.

这篇关于Python 3.4十六进制到日文字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆