Python库将多字节字符转换为Python中的7位ASCII [英] Python library to translate multi-byte characters into 7-bit ASCII in Python

查看:125
本文介绍了Python库将多字节字符转换为Python中的7位ASCII的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一个python库可以将多字节非ASCII字符转换为7位可显示ASCII的合理形式。这是为了避免硬编码答案中给出的 charmap a> 将多字节字符转换为7位Python中的ASCII



编辑:我目前使用的是Python 2.7.11或更高版本,而不是Python 3,但是给出Python 3解决方案的答案将被考虑并发现有帮助的。



原因是这样的:我手动翻译,我会想念一些:



我的脚本是:

 #!/ bin / bash 
# - * - mode:python; - * -

import os
import re
import requests

url =https://system76.com/laptops/kudu


#将请求中的文本作为真正的unicode字符串加载:

r = requests.get(url)
r.encoding =UTF- 8
data = r.text#ok,data是一个真正的unicode字符串

#翻译unicode中的冒犯字符:

charmap = {
0x2014:u'-',#em dash
0x201D:u'',#逗号引号,双
#等
}
data = data.translate(charmap )
tdata = data.encode('ascii')

我得到的错误是:

  ./ simple_wget 
追溯(最近的最后一次调用):
文件./simple_wget.py ,第25行,< module>
tdata = data.encode('ascii')
UnicodeEncodeError:'ascii'编解码器不能在位置10166处编码字符u'\\\–':ordinal不在范围内(128)

这将是一个永不为新发现的角色更新 charmap 的争议。有没有一个python库提供这个charmap,所以我不需要用这种方式进行硬编码?

解决方案

(注意:这个答案适用于Python 2.7.11 +。)



https ://stackoverflow.com/a/1701378/257924 是指Unidecode包,是我正在寻找的。在使用该包时,我还发现了我的困惑的最终来源,这一点在 https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output ,具体来说这一节:


沮丧#3:不一致的治疗输出



好的,因为python社区正在移动要使用unicode字符串,我们可能会将所有内容转换为unicode字符串,并在默认情况下使用它,对吧?听起来很好,大部分时间,但
至少有一个需要注意的一个巨大的警告。无论何时将文本输出到终端或文件,文本都必须转换为字节str。 Python将尝试将unicode隐式转换为
字节str ...但如果字节为非ASCII,则会抛出异常:


以下是我使用它的演示脚本。 name 变量中列出的字符是我正在分析的网页类型,我需要将其翻译成可读,不被删除的字符。 >

 #!/ bin / bash 
# - * - mode:python;编码:utf-8 - * -
#需要以上编码来避免此错误:SyntaxError:第9行文件./unicodedata_normalize_test.py中的非ASCII字符'\xe2',但未声明编码;请参阅http://python.org/dev/peps/pep-0263/了解详情

import os
import re
import unicodedata
from unidecode import unidecode

names = [
'HYPHEN-MINUS',
'EM DASH',
'EN DASH',
'减号',
'APOSTROPHE',
'LEFT SINGLE QUOTATION MARK',
'RIGHT SINGLE QUOTATION MARK',
'LATIN SMALL LETTER A WITH ACUTE',
]

名称中的名称:
character = unicodedata.lookup(name)
unidecoded = unidecode(character)
print
print'name',name
print 'character',character
print'unidecoded',unidecoded

上述脚本的输出示例是:

  censored @ censored:〜$ unidecode_test 

名称HYPHEN-MINUS
字符 -
unidecoded -

名称EM DASH
字符 -
unidecoded -

名称EN DASH
字符 -
unidecoded -

名称MINUS SIGN
字符 -
unidecoded -

名称APOSTROPHE
字符'
unidecoded'

名称LEFT SINGLE QUOTATION MARK
字符'
unidecoded'

名称RIGHT SINGLE QUOTATION MARK
字符'
unidecoded'

名称LATIN小写字母A与ACUTE
字符á
unidecoded a

以下更详细的脚本加载了多个具有许多unicode字符的网页。请参阅以下脚本中的注释:

 #!/ bin / bash 
# - * - mode:python;编码:utf-8 - * -

import os
import re
import subprocess
import request
from unidecode import unidecode

urls = [
'https://system76.com/laptops/kudu',
'https://stackoverflow.com/a/38249916/257924',
'https: //www.peterbe.com/plog/unicode-to-ascii',
'https://stackoverflow.com/questions/227459/ascii-value-of-a-aaracter-in-python?rq= 1#comment35813354_227472',
#取消注释以下内容,以显示此脚本不会抛出异常,但以牺牲大量差异输出为代价:
###'https:// en。 wikipedia.org/wiki/List_of_Unicode_characters',
]

#以下变量设置表示什么只是在不抛出异常的情况下工作。
#将re_encode设置为False并将not_encode设置为True会导致写入函数抛出
#b
#Traceback(最近的最后一次调用):
#文件./simple_wget .py,第52行,< module>
#file_fp.write(data [ext])
#UnicodeEncodeError:'ascii'编解码器无法编码字符u'\xe9'在位置33511:序号不在范围(128)

#这是我的混乱的关键,由https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output
#所以这就是为什么我们将re_encode设置为True而not_encode设置为False:
force_utf_8 = False
re_encode = True
not_encode = False
do_unidecode = True

在urls中的url:

#将请求中的文本作为真正的unicode字符串加载:

r = requests.get(url)
print\\ \\ n\\\
\\\

printurl:,url
打印当前编码:,r.encoding

data = {}

如果force_utf_8:
#接下来的两行不起作用。它们导致写入失败:
r.encoding =UTF-8
data ['old'] = r.text#ok,data是一个真正的unicode字符串

如果re_encode:
data ['old'] = r.text.encode(r.encoding)

如果not_encode:
data ['old'] = r.text

如果do_unidecode:
#翻译unicode中的冒犯字符:
data ['new'] = unidecode(r.text)

html_base = re对于['old','new']中的ext,.sub(r'[^ a-zA-Z0-9 _-] +','__',url)
diff_cmd =diff

如果ext在数据中:
printext:,ext
html_file ={}。{} .html.format(html_base,ext)
with open( html_file,'w')as file_fp:
file_fp.write(data [ext])
printWrote,html_file
diff_cmd = diff_cmd ++ html_file

如果'old'在数据中,'new'在数据中:
print'Executi ng:',diff_cmd
subprocess.call(diff_cmd,shell = True)

一个href =https://gist.github.com/bgoodr/1f085ef942fb71ba6af2cd7268f480f7 =nofollow noreferrer> gist显示上述脚本的输出。这显示了在旧和新html文件上执行Linux diff 命令,以便查看翻译。会出现德语等语言的错误翻译,但是对于单字和双引号类型的字符和破折号字符的翻译来说,这是很好的。


Is there a python library that provides translation of multi-byte non-ASCII characters into some reasonable form of 7-bit displayable ASCII. This is intended to avoid hard-coding the charmap as given in the answer to Translating multi-byte characters into 7-bit ASCII in Python

EDIT: I am currently using Python 2.7.11 or greater and not yet Python 3 but answers giving Python 3 solutions will be considered and found helpful.

The reason is this: As I do the translation manually, I will miss some:

My script is:

#!/bin/bash
# -*- mode: python; -*-

import os
import re
import requests

url = "https://system76.com/laptops/kudu"

#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
r.encoding = "UTF-8"
data = r.text  # ok, data is a true unicode string

# translate offending characters in unicode:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)
tdata = data.encode('ascii')

The error I get is:

./simple_wget
Traceback (most recent call last):
  File "./simple_wget.py", line 25, in <module>
    tdata = data.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 10166: ordinal not in range(128)

This will be a never-ending battle to update the charmap for newly discovered characters. Is there a python library that provides this charmap so I don't have to hardcode it in this manner?

解决方案

(Note: This answer pertains to Python 2.7.11+.)

The answer at https://stackoverflow.com/a/1701378/257924 refers to the Unidecode package and is what I was looking for. In using that package, I also discovered the ultimate source of my confusion which is elaborated in-depth at https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output and specifically this section:

Frustration #3: Inconsistent treatment of output

Alright, since the python community is moving to using unicode strings everywhere, we might as well convert everything to unicode strings and use that by default, right? Sounds good most of the time but there’s at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into a byte str. Python will try to implicitly convert from unicode to byte str... but it will throw an exception if the bytes are non-ASCII:

The following is my demonstration script to use it. The characters listed in the names variable are the characters I do need to have translated into something readable, and not removed, for the types of web pages I am analyzing.

#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
# The above coding is needed to to avoid this error: SyntaxError: Non-ASCII character '\xe2' in file ./unicodedata_normalize_test.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

import os
import re
import unicodedata
from unidecode import unidecode

names = [
    'HYPHEN-MINUS',
    'EM DASH',
    'EN DASH',
    'MINUS SIGN',
    'APOSTROPHE',
    'LEFT SINGLE QUOTATION MARK',
    'RIGHT SINGLE QUOTATION MARK',
    'LATIN SMALL LETTER A WITH ACUTE',
]

for name in names:
    character = unicodedata.lookup(name)
    unidecoded = unidecode(character)
    print
    print 'name      ',name
    print 'character ',character
    print 'unidecoded',unidecoded

Sample output of the above script is:

censored@censored:~$ unidecode_test

name       HYPHEN-MINUS
character  -
unidecoded -

name       EM DASH
character  —
unidecoded --

name       EN DASH
character  –
unidecoded -

name       MINUS SIGN
character  −
unidecoded -

name       APOSTROPHE
character  '
unidecoded '

name       LEFT SINGLE QUOTATION MARK
character  ‘
unidecoded '

name       RIGHT SINGLE QUOTATION MARK
character  ’
unidecoded '

name       LATIN SMALL LETTER A WITH ACUTE
character  á
unidecoded a

The following more elaborate script loads several web pages with many unicode characters. See the comments in the script below:

#!/bin/bash
# -*- mode: python; coding: utf-8 -*-

import os
import re
import subprocess
import requests
from unidecode import unidecode

urls = [
    'https://system76.com/laptops/kudu',
    'https://stackoverflow.com/a/38249916/257924',
    'https://www.peterbe.com/plog/unicode-to-ascii',
    'https://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python?rq=1#comment35813354_227472',
    # Uncomment out the following to show that this script works without throwing exceptions, but at the expense of a huge amount of diff output:
    ###'https://en.wikipedia.org/wiki/List_of_Unicode_characters',
]

# The following variable settings represent what just works without throwing exceptions.
# Setting re_encode to False and not_encode to True results in the write function throwing an exception of
#
#    Traceback (most recent call last):
#      File "./simple_wget.py", line 52, in <module>
#        file_fp.write(data[ext])
#    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 33511: ordinal not in range(128)
#
# This is the crux of my confusion and is explained by https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output
# So this is why we set re_encode to True and not_encode to False below:
force_utf_8 = False
re_encode = True
not_encode = False
do_unidecode = True

for url in urls:
    #
    # Load the text from request as a true unicode string:
    #
    r = requests.get(url)
    print "\n\n\n"
    print "url:",url
    print "current encoding:",r.encoding

    data = {}

    if force_utf_8:
        # The next two lines do not work. They cause the write to fail:
        r.encoding = "UTF-8"
        data['old'] = r.text  # ok, data is a true unicode string

    if re_encode:
        data['old'] = r.text.encode(r.encoding)

    if not_encode:
        data['old'] = r.text

    if do_unidecode:
        # translate offending characters in unicode:
        data['new'] = unidecode(r.text)

    html_base = re.sub(r'[^a-zA-Z0-9_-]+', '__', url)
    diff_cmd = "diff "
    for ext in [ 'old', 'new' ]:
        if ext in data:
            print "ext:",ext
            html_file = "{}.{}.html".format(html_base, ext)
            with open(html_file, 'w') as file_fp:
                file_fp.write(data[ext])
                print "Wrote",html_file
            diff_cmd = diff_cmd + " " + html_file

    if 'old' in data and 'new' in data:
        print 'Executing:',diff_cmd
        subprocess.call(diff_cmd, shell=True)

The gist showing the output of the above script. This shows the execution of the Linux diff command on the "old" and "new" html files so as to see the translations. There is going to be mistranslation of languages like German etc., but that is fine for my purposes of getting some lossy translation of single and double quote types of characters and dash-like characters.

这篇关于Python库将多字节字符转换为Python中的7位ASCII的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆