将多字节字符转换为Python中的7位ASCII [英] Translating multi-byte characters into 7-bit ASCII in Python

查看:126
本文介绍了将多字节字符转换为Python中的7位ASCII的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过Python脚本下载和解析一个网页。我需要
编码为7位ASCII进行进一步处理。我正在使用
请求库( http://docs.python-requests。 org / en / master / )根据Ubuntu 16.04 LTS的任何一个
virtualenv。



我想要请求包或一些将
转换成ASCII码,而不需要我进一步翻译
的编码字符,因为我知道我会想念一些
的字符。详细信息如下:



我当前的Python脚本(如下所示)使用ISO-8859-1
的编码,以强制将结果数据转换为7位ASCII,
,部分成功。但是,我已经设置结果编码
也会在文本出来时对文本进行编码。这似乎是奇怪的,实际上,
完全错误。但即使我生活在这里,我的主要问题是
,如下所示:



即使在编码之后,我看到似乎在在
中有一些非ASCII字符集。就像虚拟人物通过请求编码滑动
一样。下面的脚本通过
搜索并用ASCII
破折号字符替换多字节破折号编码。如果它是一个多字节的
字符,这不算什么大问题,但是怀疑有其他字符需要
才能在其他要处理的网页中翻译。我只需
需要使用除
请求对象之外的其他编码'ISO-8859-1'?



这是我的脚本(在Ubuntu 16.04 LTS x86_64上使用Python 2.7.11):

 #!/ bin / bash 

import sys
import os
import string
import re
import request

url =https://system76.com/laptops/kudu

r = requests.get(url)


#为什么我必须BOTH设置r.encoding并调用r.text.encode
#为了避免错误?:

encoding ='ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding )


#拆分行,找到违规行
#并翻译多字节字符:

lines = data对于行中的行分别为:

m = re.search(r'2.6 to 3.5 GHz',line)
如果m:
printline:{} .format(line)
m = re.search(r'\xe2\ x80 \x93',行)
#下一行中的' - '是ASCII破折号字符:
fixed_line = re.sub(r'\xe2\x80\x93', ' - ',行)
打印fixed_line {}格式(行)

在virtualenv中调用simple_wget.py显示:

  theuser @ thesystem:〜$ simple_wget.py 
line:< td> 2.6至3.5 GHz - 6 MB缓存 - 4个内核 - 8个线程< / td>
fixed_line< td> 2.6高达3.5 GHz - 6 MB缓存 - 4内核 - 8线程< / td>

将该输出通过 oc -cb 传递给请参阅$ b中与$ code> r'\xe2\x80\x93'对应的短号字符的八进制值(342 200
223) $ b以上脚本:

  theuser @ thesystem:〜$ simple_wget.py | od -cb 
0000000行:\t \t \t \t \t
154 151 156 145 072 040 040 040 040 040 040 011 011 011 011 011
0000020 \t< t d> 2。 6 u p t o 3
011 074 164 144 076 062 056 066 040 165 160 040 164 157 040 063
0000040。 5 GH z 342 200 223 6 MB
056 065 040 107 110 172 040 342 200 223 040 066 040 115 102 040
0000060高速缓存342 200 223 4核
143 141 143 150 145 040 342 200 223 040 064 040 143 157 162 145
0000100 s 342 200 223 8线程<
163 040 342 200 223 040 070 040 164 150 162 145 141 144 163 074
0000120 / t d> \\\
f i x e d _ l i n e
057 164 144 076 012 146 151 170 145 144 137 154 151 156 145 040
0000140 \t \t \t \t \t \t \\ t d> 2。 6 u p
011 011 011 011 011 011 074 164 144 076 062 056 066 040 165 160
0000160 t o 3。 5 GH z - 6
040 164 157 040 063 056 065 040 107 110 172 040 055 040 066 040
0000200 MB cache - 4 cor
115 102 040 143 141 143 150 145 040 055 040 064 040 143 157 162
0000220 es - 8线程< /
145 163 040 055 040 070 040 164 150 162 145 141 144 163 074 057
0000240 t d> \\\

164 144 076 012
0000244
theuser @ thesystem:〜$

我尝试过的东西:



https:// stackoverflow.com/a/19645137/257924 表示使用$ code> ascii 的编码
,但它会阻塞在请求库中。将
脚本更改为:

  #encoding ='ISO-8859-1'
encoding = 'ascii'#try https://stackoverflow.com/a/19645137/257924
r.encoding = encoding
data = r.text.encode(encoding)

产生:

  theuser @ thesystem: 〜$ ./simple_wget 
追溯(最近的最后一次调用):
文件./simple_wget.py,第18行,< module>
data = r.text.encode(encoding)
UnicodeEncodeError:'ascii'编解码器无法编码位置10166-10168中的字符:序号不在范围(128)

将上面的最后一行更改为

  data = r.text.encode(encoding,ignore)

被删除,没有翻译,这不是我想要的。



这也根本不起作用:

  encoding ='ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)

charmap = {
0x2014:u'-',#em dash
0x201D:u'',#逗号引号,双
#等
}
data = data.translate(charmap)

因为它提供了这个错误:

$ b $



文件./simple_wget.py,第30行,<模块>
data = tmp2.translate(charmap)
Typ eError:期望一个字符串或其他字符缓冲区对象

这是,就我可以从
https://stackoverflow.com/a/10385520/257924 ,由于数据不是
unicode字符串。一个256个字符的翻译表不会做
我需要什么。除此之外,还有一点是:
中的某些内容Python应该翻译这些多字节字符,而不需要在脚本级别使用
黑客代码。



方式,我对多语言翻译不感兴趣。所有
页面翻译预计将使用美国或英国英语。

解决方案

Python具有清理过程所需的一切非ASCII字符...如果您声明正确的编码。您的输入文件是UTF8编码的,而不是ISO-8859-1,因为 r'\xe2\x80\x93'是EN DASH字符的UTF8编码, unicode U + 2013



所以你应该:




  • 将请求中的文本作为真正的unicode字符串加载:

      url = $ https://system76.com/laptops/kudu

    r = requests.get(url)
    r.encoding =UTF-8
    data = r.text#好的,数据是一个真正的unicode字符串


  • 翻译冒犯 in unicode

      charmap = {
    0x2014:u'-',#em dash
    0x201D:u'',#逗号引号,双
    #等
    }
    data = data.translate(charmap)

    它现在将工作,因为转换映射对于字节和Unicode字符串是不同的对于字节串,转换表必须是长度为256的字符串,其中对于unicode字符串,它必须是Unicode序数到Unicode序数,Unicode字符串或无((aa = htt = unicode-list-tuple-bytearray-buffer-xrangerel =nofollow> ref:Python标准库参考手册)。


  • 然后您可以将数据安全地编码为ascii字节串:

      tdata = data.encode('ascii')

    如果数据中存在一些未翻译的非ASCII字符,则上述命令将抛出异常 unicode字符串。你可以看到,帮助确保所有的成功转换。



I'm downloading and parsing a web page via a Python script. I need it to be encoded into 7-bit ASCII for further processing. I am using the requests library (http://docs.python-requests.org/en/master/) in a virtualenv based upon whatever Ubuntu 16.04 LTS has.

I would like the requests package, or some package, to handle the translation into ASCII, without requiring me to do further translation of encoded characters, because I know I am going to miss some characters. Details are as follows:

My current Python script, shown below, uses an encoding of ISO-8859-1 in an attempt to force the result data to be converted to 7-bit ASCII, with some partial success. But, I have set the result encoding and also encode the text when it comes out. That seems odd, and in fact, downright wrong. But even if I live with that, I have the main issue which is as follows:

Even after the encoding, I see dashes encoded in what seems to be in some non-ASCII character set. It is as if the dash characters slipped through the requests encoding. The script below hacks around this by searching for and replacing the multi-byte dash encoding with an ASCII dash character. This is not a big deal if it is one multi-byte character, but suspect that there are other characters that will need to be translated in other web pages I wish to process. Do I simply need to use some other encoding other than 'ISO-8859-1' with the requests object?

Here is my script (using Python 2.7.11 on Ubuntu 16.04 LTS on x86_64):

 #!/bin/bash

 import sys
 import os
 import string
 import re
 import requests

 url = "https://system76.com/laptops/kudu"

 r = requests.get(url)

 #
 # Why do I have to BOTH set r.encoding AND call r.text.encode
 # in order to avoid the errors?:
 #
 encoding = 'ISO-8859-1'
 r.encoding = encoding
 data = r.text.encode(encoding)

 #
 # Split the lines out, find the offending line,
 # and translate the multi-byte characters:
 #
 lines = data.splitlines()
 for line in lines:
     m = re.search(r'2.6 up to 3.5 GHz', line)
     if m:
         print "line:      {}".format(line)
         m = re.search(r'\xe2\x80\x93', line)
         # The '-' in the next line is a ASCII dash character:
         fixed_line = re.sub(r'\xe2\x80\x93', '-', line)
         print "fixed_line {}".format(line)

Invoking simple_wget.py within the virtualenv shows:

theuser@thesystem:~$ simple_wget.py
line:                           <td>2.6 up to 3.5 GHz – 6 MB cache – 4 cores – 8 threads</td>
fixed_line                      <td>2.6 up to 3.5 GHz - 6 MB cache - 4 cores - 8 threads</td>

Passing that output through oc -cb to see the octal values ("342 200 223") of the dash characters corresponding to the r'\xe2\x80\x93' in the script above:

theuser@thesystem:~$ simple_wget.py | od -cb
0000000   l   i   n   e   :                          \t  \t  \t  \t  \t
        154 151 156 145 072 040 040 040 040 040 040 011 011 011 011 011
0000020  \t   <   t   d   >   2   .   6       u   p       t   o       3
        011 074 164 144 076 062 056 066 040 165 160 040 164 157 040 063
0000040   .   5       G   H   z     342 200 223       6       M   B    
        056 065 040 107 110 172 040 342 200 223 040 066 040 115 102 040
0000060   c   a   c   h   e     342 200 223       4       c   o   r   e
        143 141 143 150 145 040 342 200 223 040 064 040 143 157 162 145
0000100   s     342 200 223       8       t   h   r   e   a   d   s   <
        163 040 342 200 223 040 070 040 164 150 162 145 141 144 163 074
0000120   /   t   d   >  \n   f   i   x   e   d   _   l   i   n   e    
        057 164 144 076 012 146 151 170 145 144 137 154 151 156 145 040
0000140  \t  \t  \t  \t  \t  \t   <   t   d   >   2   .   6       u   p
        011 011 011 011 011 011 074 164 144 076 062 056 066 040 165 160
0000160       t   o       3   .   5       G   H   z       -       6    
        040 164 157 040 063 056 065 040 107 110 172 040 055 040 066 040
0000200   M   B       c   a   c   h   e       -       4       c   o   r
        115 102 040 143 141 143 150 145 040 055 040 064 040 143 157 162
0000220   e   s       -       8       t   h   r   e   a   d   s   <   /
        145 163 040 055 040 070 040 164 150 162 145 141 144 163 074 057
0000240   t   d   >  \n
        164 144 076 012
0000244
theuser@thesystem:~$

Things I've tried:

https://stackoverflow.com/a/19645137/257924 implies using an encoding of ascii, but it chokes inside the requests library. Changing the script to be:

#encoding = 'ISO-8859-1'
encoding = 'ascii' # try https://stackoverflow.com/a/19645137/257924
r.encoding = encoding
data = r.text.encode(encoding)

yields:

theuser@thesystem:~$ ./simple_wget
Traceback (most recent call last):
  File "./simple_wget.py", line 18, in <module>
    data = r.text.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10166-10168: ordinal not in range(128)

Changing the last line above to be

data = r.text.encode(encoding, "ignore")

results in the dashes just being removed, not translated which is not what I want.

And this also does not work at all:

encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)

because it gives this error:

Traceback (most recent call last):
  File "./simple_wget.py", line 30, in <module>
    data = tmp2.translate(charmap)
TypeError: expected a string or other character buffer object

which is, as far as I can understand from https://stackoverflow.com/a/10385520/257924, due to "data" not being a unicode string. A 256-character translation table is not going to do what I need anyhow. And besides that is overkill: something inside Python should translate these multi-byte characters without requiring hack code at my script level.

By the way, I'm not interested in multi-lingual page translation. All pages translated are expected to be in US or British English.

解决方案

Python has everything you need to cleanly process non ASCII characters... provided you declare the proper encoding. Your input file is UTF8 encoded, not ISO-8859-1, because r'\xe2\x80\x93' is the UTF8 encoding for the EN DASH character or unicode U+2013.

So you should:

  • load the text from request as a true unicode string:

    url = "https://system76.com/laptops/kudu"
    
    r = requests.get(url)
    r.encoding = "UTF-8"
    data = r.text  # ok, data is a true unicode string
    

  • translate offending characters in unicode:

    charmap = {
        0x2014: u'-',   # em dash
        0x201D: u'"',   # comma quotation mark, double
        # etc.
    }
    data = data.translate(charmap)
    

    It will work now, because the translate map is different for byte and unicode strings. For byte strings, the translation table must be a string of length 256, whereas for unicode strings it must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None (ref: Python Standard Library Reference Manual).

  • then you can safely encode data to an ascii byte string:

    tdata = data.encode('ascii')
    

    The above command will throw exception if some untranslated non ascii characters remains in the data unicode string. You can see that as a help to be sure that everything as been successfully converted.

这篇关于将多字节字符转换为Python中的7位ASCII的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆