BeautifulSoup 3.1解析器符太容易 [英] BeautifulSoup 3.1 parser breaks far too easily

查看:221
本文介绍了BeautifulSoup 3.1解析器符太容易的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了麻烦与解析一些BeautifulSoup HTML狡猾。原来,在新版本中使用的的HTMLParser比使用化SGMLParser previously不宽容。


是否BeautifulSoup有某种调试模式?我试图找出如何阻止它borking一些讨厌的HTML,我从倔网站加载:

Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website:

<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>

HTTP-EQUIV ...&GT;

BeautifulSoup的℃后,放弃了标签

BeautifulSoup gives up after the <HTTP-EQUIV...> tag

In [1]: print BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

问题显然是HTTP-EQUIV标签,这实在是一个的非常的格式不正确&LT; META HTTP-EQUIV =PRAGMACONTENT =NO-CACHE&GT ; 标记。显然,我需要指定这是自闭,但不管如何我指定我不能修复它:

The problem is clearly the HTTP-EQUIV tag, which is really a very malformed <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"> tag. Evidently, I need to specify this as self-closing, but no matter what I specify I can't fix it:

In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
                            'http-equiv="pragma"']).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

有没有详细的调试模式,其中BeautifulSoup会告诉我它是什么做的,所以我可以找出它在这种情况下,治疗作为变量名称?

Is there a verbose debug mode in which BeautifulSoup will tell me what it is doing, so I can figure out what it is treating as the tag name in this case?

推荐答案

您的问题一定是别的东西;它工作正常,我:

Your problem must be something else; it works fine for me:

In [1]: import BeautifulSoup

In [2]: c = """<HTML>
   ...:     <HEAD>
   ...:         <TITLE>Title</TITLE>
   ...:         <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
   ...:     </HEAD>
   ...:     <BODY>
   ...:         ...
   ...:         ...
   ...:     </BODY>
   ...: </HTML>
   ...: """

In [3]: print BeautifulSoup.BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
  <http-equiv>
  </http-equiv>
 </head>
 <body>
  ...
        ...
 </body>
</html>


In [4]:

这是Python的2.5.2与BeautifulSoup 3.0.7a - 也许这是中老年/新版本有什么不同?这正是那种汤BeautifulSoup的处理如此美妙,所以我怀疑它在某个时候被改变......有没有结构别的东西,你有没有在问题中提到?

This is Python 2.5.2 with BeautifulSoup 3.0.7a — maybe it's different in older/newer versions? This is exactly the kind of soup BeautifulSoup handles so beautifully, so I doubt it's been changed at some point… Is there something else to the structure that you haven't mentioned in the problem?

这篇关于BeautifulSoup 3.1解析器符太容易的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆