Python 库将多字节字符转换为 Python 中的 7 位 ASCII [英] Python library to translate multi-byte characters into 7-bit ASCII in Python

查看:20
本文介绍了Python 库将多字节字符转换为 Python 中的 7 位 ASCII的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有提供将多字节非 ASCII 字符转换为某种合理形式的 7 位可显示 ASCII 的 Python 库.这是为了避免对 answer 中给出的 charmap 进行硬编码到 在 Python 中将多字节字符转换为 7 位 ASCII

Is there a python library that provides translation of multi-byte non-ASCII characters into some reasonable form of 7-bit displayable ASCII. This is intended to avoid hard-coding the charmap as given in the answer to Translating multi-byte characters into 7-bit ASCII in Python

我目前使用的是 Python 2.7.11 或更高版本,但尚未使用 Python 3,但将考虑提供 Python 3 解决方案的答案并发现其有帮助.

I am currently using Python 2.7.11 or greater and not yet Python 3 but answers giving Python 3 solutions will be considered and found helpful.

原因是这样的:当我手动翻译时,我会错过一些:

The reason is this: As I do the translation manually, I will miss some:

我的脚本是:

#!/bin/bash
# -*- mode: python; -*-

import os
import re
import requests

url = "https://system76.com/laptops/kudu"

#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
r.encoding = "UTF-8"
data = r.text  # ok, data is a true unicode string

# translate offending characters in unicode:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)
tdata = data.encode('ascii')

我得到的错误是:

./simple_wget
Traceback (most recent call last):
  File "./simple_wget.py", line 25, in <module>
    tdata = data.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'u2013' in position 10166: ordinal not in range(128)

这将是一场永无止境的战斗,为新发现的角色更新 charmap.是否有提供此 charmap 的 python 库,因此我不必以这种方式对其进行硬编码?

This will be a never-ending battle to update the charmap for newly discovered characters. Is there a python library that provides this charmap so I don't have to hardcode it in this manner?

推荐答案

(注意:此答案适用于 Python 2.7.11+.)

(Note: This answer pertains to Python 2.7.11+.)

https://stackoverflow.com/a/1701378/257924 上的答案是指 Unidecode 包,是我在找什么.在使用该包时,我还发现了我的困惑的最终来源,在 https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output,特别是本节:

The answer at https://stackoverflow.com/a/1701378/257924 refers to the Unidecode package and is what I was looking for. In using that package, I also discovered the ultimate source of my confusion which is elaborated in-depth at https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output and specifically this section:

好吧,既然 Python 社区正在转向在任何地方使用 unicode 字符串,我们不妨将所有内容都转换为 unicode 字符串并默认使用它,对吧?大多数时候听起来不错,但至少有一个重要的警告需要注意.每当您将文本输出到终端或文件时,文本都必须转换为字节 str.Python 将尝试从 unicode 隐式转换为byte str...但如果字节是非ASCII,它会抛出异常:

Frustration #3: Inconsistent treatment of output

Alright, since the python community is moving to using unicode strings everywhere, we might as well convert everything to unicode strings and use that by default, right? Sounds good most of the time but there’s at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into a byte str. Python will try to implicitly convert from unicode to byte str... but it will throw an exception if the bytes are non-ASCII:

以下是我使用它的演示脚本.names 变量中列出的字符是我确实需要将字符转换为可读的字符,而不是删除,对于我正在分析的网页类型.

The following is my demonstration script to use it. The characters listed in the names variable are the characters I do need to have translated into something readable, and not removed, for the types of web pages I am analyzing.

#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
# The above coding is needed to to avoid this error: SyntaxError: Non-ASCII character 'xe2' in file ./unicodedata_normalize_test.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

import os
import re
import unicodedata
from unidecode import unidecode

names = [
    'HYPHEN-MINUS',
    'EM DASH',
    'EN DASH',
    'MINUS SIGN',
    'APOSTROPHE',
    'LEFT SINGLE QUOTATION MARK',
    'RIGHT SINGLE QUOTATION MARK',
    'LATIN SMALL LETTER A WITH ACUTE',
]

for name in names:
    character = unicodedata.lookup(name)
    unidecoded = unidecode(character)
    print
    print 'name      ',name
    print 'character ',character
    print 'unidecoded',unidecoded

上述脚本的示例输出是:

Sample output of the above script is:

censored@censored:~$ unidecode_test

name       HYPHEN-MINUS
character  -
unidecoded -

name       EM DASH
character  —
unidecoded --

name       EN DASH
character  –
unidecoded -

name       MINUS SIGN
character  −
unidecoded -

name       APOSTROPHE
character  '
unidecoded '

name       LEFT SINGLE QUOTATION MARK
character  ‘
unidecoded '

name       RIGHT SINGLE QUOTATION MARK
character  ’
unidecoded '

name       LATIN SMALL LETTER A WITH ACUTE
character  á
unidecoded a

以下更详细的脚本加载了多个带有许多 unicode 字符的网页.请参阅下面脚本中的注释:

The following more elaborate script loads several web pages with many unicode characters. See the comments in the script below:

#!/bin/bash
# -*- mode: python; coding: utf-8 -*-

import os
import re
import subprocess
import requests
from unidecode import unidecode

urls = [
    'https://system76.com/laptops/kudu',
    'https://stackoverflow.com/a/38249916/257924',
    'https://www.peterbe.com/plog/unicode-to-ascii',
    'https://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python?rq=1#comment35813354_227472',
    # Uncomment out the following to show that this script works without throwing exceptions, but at the expense of a huge amount of diff output:
    ###'https://en.wikipedia.org/wiki/List_of_Unicode_characters',
]

# The following variable settings represent what just works without throwing exceptions.
# Setting re_encode to False and not_encode to True results in the write function throwing an exception of
#
#    Traceback (most recent call last):
#      File "./simple_wget.py", line 52, in <module>
#        file_fp.write(data[ext])
#    UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 33511: ordinal not in range(128)
#
# This is the crux of my confusion and is explained by https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output
# So this is why we set re_encode to True and not_encode to False below:
force_utf_8 = False
re_encode = True
not_encode = False
do_unidecode = True

for url in urls:
    #
    # Load the text from request as a true unicode string:
    #
    r = requests.get(url)
    print "


"
    print "url:",url
    print "current encoding:",r.encoding

    data = {}

    if force_utf_8:
        # The next two lines do not work. They cause the write to fail:
        r.encoding = "UTF-8"
        data['old'] = r.text  # ok, data is a true unicode string

    if re_encode:
        data['old'] = r.text.encode(r.encoding)

    if not_encode:
        data['old'] = r.text

    if do_unidecode:
        # translate offending characters in unicode:
        data['new'] = unidecode(r.text)

    html_base = re.sub(r'[^a-zA-Z0-9_-]+', '__', url)
    diff_cmd = "diff "
    for ext in [ 'old', 'new' ]:
        if ext in data:
            print "ext:",ext
            html_file = "{}.{}.html".format(html_base, ext)
            with open(html_file, 'w') as file_fp:
                file_fp.write(data[ext])
                print "Wrote",html_file
            diff_cmd = diff_cmd + " " + html_file

    if 'old' in data and 'new' in data:
        print 'Executing:',diff_cmd
        subprocess.call(diff_cmd, shell=True)

显示上述脚本输出的 gist.这显示了 Linux diff 命令对旧"和新"html 文件的执行,以便查看翻译.德语等语言可能会出现错误翻译,但这对于我对单引号和双引号类型的字符以及类似破折号的字符进行一些有损翻译的目的来说很好.

The gist showing the output of the above script. This shows the execution of the Linux diff command on the "old" and "new" html files so as to see the translations. There is going to be mistranslation of languages like German etc., but that is fine for my purposes of getting some lossy translation of single and double quote types of characters and dash-like characters.

这篇关于Python 库将多字节字符转换为 Python 中的 7 位 ASCII的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆