BeautifulSoup 3.1解析器符太容易 [英] BeautifulSoup 3.1 parser breaks far too easily
问题描述
的我遇到了麻烦与解析一些BeautifulSoup HTML狡猾。原来,在新版本中使用的的HTMLParser比使用化SGMLParser previously不宽容。的
是否BeautifulSoup有某种调试模式?我试图找出如何阻止它borking一些讨厌的HTML,我从倔网站加载:
Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website:
<HTML>
<HEAD>
<TITLE>Title</TITLE>
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
</HEAD>
<BODY>
...
...
</BODY>
</HTML>
HTTP-EQUIV ...&GT;
BeautifulSoup的℃后,放弃了
标签
BeautifulSoup gives up after the <HTTP-EQUIV...>
tag
In [1]: print BeautifulSoup(c).prettify()
<html>
<head>
<title>
Title
</title>
</head>
</html>
问题显然是HTTP-EQUIV标签,这实在是一个的非常的格式不正确&LT; META HTTP-EQUIV =PRAGMACONTENT =NO-CACHE&GT ;
标记。显然,我需要指定这是自闭,但不管如何我指定我不能修复它:
The problem is clearly the HTTP-EQUIV tag, which is really a very malformed <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
tag. Evidently, I need to specify this as self-closing, but no matter what I specify I can't fix it:
In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
'http-equiv="pragma"']).prettify()
<html>
<head>
<title>
Title
</title>
</head>
</html>
有没有详细的调试模式,其中BeautifulSoup会告诉我它是什么做的,所以我可以找出它在这种情况下,治疗作为变量名称?
Is there a verbose debug mode in which BeautifulSoup will tell me what it is doing, so I can figure out what it is treating as the tag name in this case?
推荐答案
您的问题一定是别的东西;它工作正常,我:
Your problem must be something else; it works fine for me:
In [1]: import BeautifulSoup
In [2]: c = """<HTML>
...: <HEAD>
...: <TITLE>Title</TITLE>
...: <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
...: </HEAD>
...: <BODY>
...: ...
...: ...
...: </BODY>
...: </HTML>
...: """
In [3]: print BeautifulSoup.BeautifulSoup(c).prettify()
<html>
<head>
<title>
Title
</title>
<http-equiv>
</http-equiv>
</head>
<body>
...
...
</body>
</html>
In [4]:
这是Python的2.5.2与BeautifulSoup 3.0.7a - 也许这是中老年/新版本有什么不同?这正是那种汤BeautifulSoup的处理如此美妙,所以我怀疑它在某个时候被改变......有没有结构别的东西,你有没有在问题中提到?
This is Python 2.5.2 with BeautifulSoup 3.0.7a — maybe it's different in older/newer versions? This is exactly the kind of soup BeautifulSoup handles so beautifully, so I doubt it's been changed at some point… Is there something else to the structure that you haven't mentioned in the problem?
这篇关于BeautifulSoup 3.1解析器符太容易的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!