在 Python 中将多字节字符转换为 7 位 ASCII [英] Translating multi-byte characters into 7-bit ASCII in Python

查看:42
本文介绍了在 Python 中将多字节字符转换为 7 位 ASCII的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过 Python 脚本下载和解析网页.我需要它编码为 7 位 ASCII 以供进一步处理.我正在使用请求库 (http://docs.python-requests.org/en/master/) 在一个virtualenv 基于 Ubuntu 16.04 LTS 的任何内容.

I'm downloading and parsing a web page via a Python script. I need it to be encoded into 7-bit ASCII for further processing. I am using the requests library (http://docs.python-requests.org/en/master/) in a virtualenv based upon whatever Ubuntu 16.04 LTS has.

我想要请求包或某个包来处理翻译成 ASCII,不需要我做进一步的翻译编码字符,因为我知道我会错过一些人物.详情如下:

I would like the requests package, or some package, to handle the translation into ASCII, without requiring me to do further translation of encoded characters, because I know I am going to miss some characters. Details are as follows:

我当前的 Python 脚本(如下所示)使用 ISO-8859-1 编码试图强制将结果数据转换为 7 位 ASCII,取得了部分成功.但是,我已经设置了结果编码 文本出现时也要对其进行编码.这看起来很奇怪,事实上,完全错误.但即使我接受了,我也有主要问题如下:

My current Python script, shown below, uses an encoding of ISO-8859-1 in an attempt to force the result data to be converted to 7-bit ASCII, with some partial success. But, I have set the result encoding and also encode the text when it comes out. That seems odd, and in fact, downright wrong. But even if I live with that, I have the main issue which is as follows:

即使在编码之后,我也看到以似乎在一些非 ASCII 字符集.就好像破折号字符滑落了通过请求编码.下面的脚本解决了这个问题使用 ASCII 搜索和替换多字节破折号编码破折号字符.如果是一个多字节,这没什么大不了的字符,但怀疑还有其他字符需要在我希望处理的其他网页中翻译.难道我只是需要使用除ISO-8859-1"以外的其他编码请求对象?

Even after the encoding, I see dashes encoded in what seems to be in some non-ASCII character set. It is as if the dash characters slipped through the requests encoding. The script below hacks around this by searching for and replacing the multi-byte dash encoding with an ASCII dash character. This is not a big deal if it is one multi-byte character, but suspect that there are other characters that will need to be translated in other web pages I wish to process. Do I simply need to use some other encoding other than 'ISO-8859-1' with the requests object?

这是我的脚本(在 x86_64 上的 Ubuntu 16.04 LTS 上使用 Python 2.7.11):

Here is my script (using Python 2.7.11 on Ubuntu 16.04 LTS on x86_64):

 #!/bin/bash

 import sys
 import os
 import string
 import re
 import requests

 url = "https://system76.com/laptops/kudu"

 r = requests.get(url)

 #
 # Why do I have to BOTH set r.encoding AND call r.text.encode
 # in order to avoid the errors?:
 #
 encoding = 'ISO-8859-1'
 r.encoding = encoding
 data = r.text.encode(encoding)

 #
 # Split the lines out, find the offending line,
 # and translate the multi-byte characters:
 #
 lines = data.splitlines()
 for line in lines:
     m = re.search(r'2.6 up to 3.5 GHz', line)
     if m:
         print "line:      {}".format(line)
         m = re.search(r'xe2x80x93', line)
         # The '-' in the next line is a ASCII dash character:
         fixed_line = re.sub(r'xe2x80x93', '-', line)
         print "fixed_line {}".format(line)

在 virtualenv 中调用 simple_wget.py 显示:

Invoking simple_wget.py within the virtualenv shows:

theuser@thesystem:~$ simple_wget.py
line:                           <td>2.6 up to 3.5 GHz – 6 MB cache – 4 cores – 8 threads</td>
fixed_line                      <td>2.6 up to 3.5 GHz - 6 MB cache - 4 cores - 8 threads</td>

通过 oc -cb 传递该输出以查看八进制值 ("342 200223") 中的 r'xe2x80x93' 对应的破折号字符上面的脚本:

Passing that output through oc -cb to see the octal values ("342 200 223") of the dash characters corresponding to the r'xe2x80x93' in the script above:

theuser@thesystem:~$ simple_wget.py | od -cb
0000000   l   i   n   e   :                          	  	  	  	  	
        154 151 156 145 072 040 040 040 040 040 040 011 011 011 011 011
0000020  	   <   t   d   >   2   .   6       u   p       t   o       3
        011 074 164 144 076 062 056 066 040 165 160 040 164 157 040 063
0000040   .   5       G   H   z     342 200 223       6       M   B    
        056 065 040 107 110 172 040 342 200 223 040 066 040 115 102 040
0000060   c   a   c   h   e     342 200 223       4       c   o   r   e
        143 141 143 150 145 040 342 200 223 040 064 040 143 157 162 145
0000100   s     342 200 223       8       t   h   r   e   a   d   s   <
        163 040 342 200 223 040 070 040 164 150 162 145 141 144 163 074
0000120   /   t   d   >  
   f   i   x   e   d   _   l   i   n   e    
        057 164 144 076 012 146 151 170 145 144 137 154 151 156 145 040
0000140  	  	  	  	  	  	   <   t   d   >   2   .   6       u   p
        011 011 011 011 011 011 074 164 144 076 062 056 066 040 165 160
0000160       t   o       3   .   5       G   H   z       -       6    
        040 164 157 040 063 056 065 040 107 110 172 040 055 040 066 040
0000200   M   B       c   a   c   h   e       -       4       c   o   r
        115 102 040 143 141 143 150 145 040 055 040 064 040 143 157 162
0000220   e   s       -       8       t   h   r   e   a   d   s   <   /
        145 163 040 055 040 070 040 164 150 162 145 141 144 163 074 057
0000240   t   d   >  

        164 144 076 012
0000244
theuser@thesystem:~$

我尝试过的事情:

https://stackoverflow.com/a/19645137/257924 暗示使用编码ascii ,但它在请求库中窒息.改变脚本为:

https://stackoverflow.com/a/19645137/257924 implies using an encoding of ascii, but it chokes inside the requests library. Changing the script to be:

#encoding = 'ISO-8859-1'
encoding = 'ascii' # try https://stackoverflow.com/a/19645137/257924
r.encoding = encoding
data = r.text.encode(encoding)

产量:

theuser@thesystem:~$ ./simple_wget
Traceback (most recent call last):
  File "./simple_wget.py", line 18, in <module>
    data = r.text.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10166-10168: ordinal not in range(128)

将上面最后一行改为

data = r.text.encode(encoding, "ignore")

导致破折号被删除,而不是翻译,这不是我想要的.

results in the dashes just being removed, not translated which is not what I want.

这也根本不起作用:

encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)

因为它给出了这个错误:

because it gives this error:

Traceback (most recent call last):
  File "./simple_wget.py", line 30, in <module>
    data = tmp2.translate(charmap)
TypeError: expected a string or other character buffer object

据我所知https://stackoverflow.com/a/10385520/257924,由于数据"不是Unicode 字符串.256个字符的翻译表不行无论如何我需要什么.除此之外,这是矫枉过正:里面的东西Python 应该翻译这些多字节字符而不需要在我的脚本级别破解代码.

which is, as far as I can understand from https://stackoverflow.com/a/10385520/257924, due to "data" not being a unicode string. A 256-character translation table is not going to do what I need anyhow. And besides that is overkill: something inside Python should translate these multi-byte characters without requiring hack code at my script level.

顺便说一下,我对多语言页面翻译不感兴趣.全部翻译的页面应为美式或英式英语.

By the way, I'm not interested in multi-lingual page translation. All pages translated are expected to be in US or British English.

推荐答案

Python 拥有干净处理非 ASCII 字符所需的一切……只要您声明正确的编码.您的输入文件是 UTF8 编码,而不是 ISO-8859-1,因为 r'xe2x80x93' 是 EN DASH 字符或 unicode U+2013 的 UTF8 编码代码>.

Python has everything you need to cleanly process non ASCII characters... provided you declare the proper encoding. Your input file is UTF8 encoded, not ISO-8859-1, because r'xe2x80x93' is the UTF8 encoding for the EN DASH character or unicode U+2013.

所以你应该:

  • 从请求中加载文本作为真正的 unicode 字符串:

  • load the text from request as a true unicode string:

url = "https://system76.com/laptops/kudu"

r = requests.get(url)
r.encoding = "UTF-8"
data = r.text  # ok, data is a true unicode string

  • 翻译违规字符unicode:

    charmap = {
        0x2014: u'-',   # em dash
        0x201D: u'"',   # comma quotation mark, double
        # etc.
    }
    data = data.translate(charmap)
    

    它现在可以工作了,因为 translate 映射对于字节和 unicode 字符串是不同的.对于字节字符串,转换表必须是长度为 256 的字符串,而对于 unicode 字符串,它必须是 Unicode 序数到 Unicode 序数、Unicode 字符串或 None 的映射(ref: Python 标准库参考手册).

    It will work now, because the translate map is different for byte and unicode strings. For byte strings, the translation table must be a string of length 256, whereas for unicode strings it must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None (ref: Python Standard Library Reference Manual).

    然后您就可以安全地将数据编码为 ascii 字节字符串:

    then you can safely encode data to an ascii byte string:

    tdata = data.encode('ascii')
    

    如果在 data unicode 字符串中保留一些未翻译的非 ascii 字符,上述命令将抛出异常.您可以将其视为有助于确保所有内容都已成功转换.

    The above command will throw exception if some untranslated non ascii characters remains in the data unicode string. You can see that as a help to be sure that everything as been successfully converted.

    这篇关于在 Python 中将多字节字符转换为 7 位 ASCII的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆