Python HTMLParser：UnicodeDecodeError [英] Python HTMLParser: UnicodeDecodeError

查看：189 发布时间：2016/11/19 15:02:16 python character-encoding html-parsing

本文介绍了Python HTMLParser：UnicodeDecodeError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用HTMLParser来解析我使用urllib下拉的页面，并且遇到 UnicodeDecodeError 异常时将一些传递给 HTMLParser 。

I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser.

我尝试使用 chardet 检测编码并转换为 ascii 或 utf-8 （ docs 似乎没有说它应该是什么）。损失是可以接受的，但是尽管解码/编码行工作正常，我总是得到self.feed（）后的错误。

I tried using chardet to detect the encodings and to convert to ascii, or utf-8 (the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().

信息是，如果我只是<

The information is there if I just print it out.

from HTMLParser import HTMLParser
import urllib
import chardet

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            page = urllib.urlopen(query + search).read()
            try:
                self.feed(page)
            except UnicodeDecodeError:
                encoding = chardet.detect(page)['encoding']
                if encoding != 'unicode':
                    page = page.decode(encoding)
                    page = page.encode('ascii', 'ignore')
                self.feed(page)
                print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

这里是输出：

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    results = search_youtube(searches)
  File "test.py", line 23, in __init__
    self.feed(page)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

推荐答案

确实是UTF-8。这工作：

It is UTF-8, indeed. This works:

from HTMLParser import HTMLParser
import urllib

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            connection = urllib.urlopen(query + search)
            encoding = connection.headers.getparam('charset')
            page = connection.read().decode(encoding)
            self.feed(page)
            print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

你不需要chardet，Youtube不是蠢货，他们实际上在标题中发送正确的编码。

You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.

这篇关于Python HTMLParser：UnicodeDecodeError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python HTMLParser：UnicodeDecodeError [英] Python HTMLParser: UnicodeDecodeError

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python HTMLParser：UnicodeDecodeError [英] Python HTMLParser: UnicodeDecodeError

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭