BeautifulSoup 3.1解析器符太容易 [英] BeautifulSoup 3.1 parser breaks far too easily

查看：221 发布时间：2016/8/5 19:08:07 python html parsing beautifulsoup

本文介绍了BeautifulSoup 3.1解析器符太容易的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

的我遇到了麻烦与解析一些BeautifulSoup HTML狡猾。原来，在新版本中使用的的HTMLParser比使用化SGMLParser previously不宽容。的

是否BeautifulSoup有某种调试模式？我试图找出如何阻止它borking一些讨厌的HTML，我从倔网站加载：

Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website:

<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>

HTTP-EQUIV ...＆GT;

BeautifulSoup的℃后，放弃了标签

BeautifulSoup gives up after the <HTTP-EQUIV...> tag

In [1]: print BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

问题显然是HTTP-EQUIV标签，这实在是一个的非常的格式不正确＆LT; META HTTP-EQUIV =PRAGMACONTENT =NO-CACHE＆GT ; 标记。显然，我需要指定这是自闭，但不管如何我指定我不能修复它：

The problem is clearly the HTTP-EQUIV tag, which is really a very malformed <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"> tag. Evidently, I need to specify this as self-closing, but no matter what I specify I can't fix it:

In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
                            'http-equiv="pragma"']).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

有没有详细的调试模式，其中BeautifulSoup会告诉我它是什么做的，所以我可以找出它在这种情况下，治疗作为变量名称？

Is there a verbose debug mode in which BeautifulSoup will tell me what it is doing, so I can figure out what it is treating as the tag name in this case?

推荐答案

您的问题一定是别的东西;它工作正常，我：

Your problem must be something else; it works fine for me:

In [1]: import BeautifulSoup

In [2]: c = """<HTML>
   ...:     <HEAD>
   ...:         <TITLE>Title</TITLE>
   ...:         <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
   ...:     </HEAD>
   ...:     <BODY>
   ...:         ...
   ...:         ...
   ...:     </BODY>
   ...: </HTML>
   ...: """

In [3]: print BeautifulSoup.BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
  <http-equiv>
  </http-equiv>
 </head>
 <body>
  ...
        ...
 </body>
</html>


In [4]:

这是Python的2.5.2与BeautifulSoup 3.0.7a - 也许这是中老年/新版本有什么不同？这正是那种汤BeautifulSoup的处理如此美妙，所以我怀疑它在某个时候被改变......有没有结构别的东西，你有没有在问题中提到？

This is Python 2.5.2 with BeautifulSoup 3.0.7a — maybe it's different in older/newer versions? This is exactly the kind of soup BeautifulSoup handles so beautifully, so I doubt it's been changed at some point… Is there something else to the structure that you haven't mentioned in the problem?

这篇关于BeautifulSoup 3.1解析器符太容易的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup 3.1解析器符太容易 [英] BeautifulSoup 3.1 parser breaks far too easily

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

BeautifulSoup 3.1解析器符太容易 [英] BeautifulSoup 3.1 parser breaks far too easily

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭