畸形的开始标记错误 - Python中，BeautifulSoup和Sipie - Ubuntu的10.04 [英] malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

查看：337 发布时间：2016/8/5 19:03:31 python beautifulsoup

本文介绍了畸形的开始标记错误 - Python中，BeautifulSoup和Sipie - Ubuntu的10.04的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我刚装蟒蛇，mplayer的，beautifulsoup和sipie到我的Ubuntu 10.04机器上运行天狼。我跟着一些文档，似乎简单，但我遇到的一些问题。我没那么熟悉Python，所以这可能是我的联赛。

I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some issues. I'm not that familiar with Python, so this may be out of my league.

我能得到一切安装，但随后运行sipie给出了这样的：

I was able to get everything installed, but then running sipie gives this:

/usr/bin/Sipie/Sipie/Config.py:12：德precationWarning：MD5的模块去precated;使用hashlib代替进口MD5 结果
回溯（最后最近一次调用）：   文件/usr/bin/Sipie/sipie.py22行，上述＆lt;＆模块GT;     Sipie.cliPlayer（）结果
  文件/usr/bin/Sipie/Sipie/cliPlayer.py，第74行，在cliPlayer     完成者=完成者（sipie.getStreams（））结果
  文件/usr/bin/Sipie/Sipie/Factory.py，线路374，在getStreams     流= self.tryGetStreams（）结果
  文件/usr/bin/Sipie/Sipie/Factory.py，298线，在tryGetStreams     汤= BeautifulSoup（数据）结果
文件/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py，1499线，在__init__     BeautifulStoneSoup .__的init __（自我，* ARGS，** kwargs）结果
  文件/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py，1230线，在__init__     self._feed（isHTML = isHTML）结果
  文件/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py，线路1263，在_feed     self.builder.feed（标记）结果
文件/usr/lib/python2.6/HTMLParser.py，线路108，在饲料中添加     self.goahead（0）结果
  文件/usr/lib/python2.6/HTMLParser.py，148线，在的GoAhead     K = self.parse_starttag（I）结果
  文件/usr/lib/python2.6/HTMLParser.py，线路226，在parse_starttag     endpos = self.check_for_whole_start_tag（I）结果
  文件/usr/lib/python2.6/HTMLParser.py，线路301，在check_for_whole_start_tag     self.error（畸形的开始标记）结果
  文件/usr/lib/python2.6/HTMLParser.py，线路115错误     提高HTMLParseError（消息，self.getpos（））结果
HTMLParser.HTMLParseError：畸形的开始标记，在第100行，第3列

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last): File "/usr/bin/Sipie/sipie.py", line 22, in <module> Sipie.cliPlayer()
File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer completer = Completer(sipie.getStreams())
File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams streams = self.tryGetStreams()
File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams soup = BeautifulSoup(data)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我通过这些文件和行号看，但由于我不熟悉的使用Python，它并没有多大意义。关于下一步该怎么做任何意见？

I looked through these files and the line numbers, but since I am unfamiliar with Python, it doesn't make much sense. Any advice on what to do next?

推荐答案

您所遇到的问题是pretty常见的，他们用恶意形成的HTML专门处理。就我而言，有其中有双引号的属性值的HTML元素。今天我遇到了这个问题，其实，并在这样做，以便在您的帖子就来了。我终于能够通过它交给落BeautifulSoup 4日之前通过解析html5lib的HTML来解决这个问题。

The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.

首先，你需要：

sudo easy_install bs4
sudo apt-get install python-html5lib

然后，运行这个例子code：

Then, run this example code:

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

如果您对本code任何疑问或需要一些更具体的指导，只是让我知道。：）

If you have any questions about this code or need a little more specific guidance, just let me know. :)

这篇关于畸形的开始标记错误 - Python中，BeautifulSoup和Sipie - Ubuntu的10.04的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

畸形的开始标记错误 - Python中，BeautifulSoup和Sipie - Ubuntu的10.04 [英] malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

畸形的开始标记错误 - Python中，BeautifulSoup和Sipie - Ubuntu的10.04 [英] malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭