我如何删除BeautifulSoup虚假标签 [英] How do I remove a spurious tag in BeautifulSoup

查看:209
本文介绍了我如何删除BeautifulSoup虚假标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我甩开了presidential辩论的文字。我到rel=\"nofollow\" 之一是有一个问题:它错误地变成单词辩论的每一个提成标记<&辩论GT; 。来吧,搜索欢迎回到共和党presidential注意到一个明显的字不见了?

I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag<debate>. Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing?

酷,所以BeautifulSoup做清理杂乱的HTML和添加结束标记的出色的工作为他们的的一直。但是,在这种情况下,渣土我,因为&LT;辩论&GT; 现在是一个孩子&LT; P&GT; 和结束&LT; /辩论&GT; 加末allllll的方式;因此,该嵌套标签内剩余的辩论。

Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate> is now a child of a <p> and the closing </debate> is added allllll the way at the end; thus nesting the remaining debate inside that tag.

我如何告诉BeautifulSoup忽略或删除&LT;&辩论GT; ?或者可以选择,我怎么后立即添加结束标记?我试过拆开包装,但什么时候可以调用它,BS已经成立了关闭标签底,从而做出以下段落的孩子,而不是兄弟姐妹。

How do I tell BeautifulSoup to either ignore or remove <debate>? Or alternatively, how do I add a closing tag immediately after? I've tried unwrap, but by the time I can call it, BS has already set up the closing tag at the end, and thus made following paragraphs children rather than siblings.

下面就是我如何设置:

from bs4 import BeautifulSoup
import urllib

bad_debate = 'http://www.presidency.ucsb.edu/ws/index.php?pid=111395'
file = urllib.urlopen(bad_debate)
soup = BeautifulSoup(file)

我的直觉是我需要插入网址通话和BeautifulSoup之间的事情,但对我的生活我无法弄清楚如何修改文件的内容。

My hunch is I need to insert something between the url call and BeautifulSoup, but for the life of me I can't figure out how to modify the file contents.

推荐答案

html5lib 解析器做一个更好的工作(不是 LXML HTML .parser )处理辩论在这种情况下,元素:

html5lib parser does a better job (than lxml or html.parser) handling the debate element in this case:

soup = BeautifulSoup(file, "html5lib")

下面是如何处理争论的上述部分:

Here is how it handles the mentioned part of the debate:

<p>
    <b>
     BARTIROMO:
    </b>
    Welcome back to the Republican presidential
    <debate>
     here in North Charleston. Right back to the questions. [
     <i>
      applause
     </i>
     ]
    </debate>
</p>

这篇关于我如何删除BeautifulSoup虚假标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆