我如何删除BeautifulSoup虚假标签 [英] How do I remove a spurious tag in BeautifulSoup

查看：209 发布时间：2016/8/5 19:11:09 python beautifulsoup urllib

本文介绍了我如何删除BeautifulSoup虚假标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我甩开了presidential辩论的文字。我到rel=\"nofollow\" 之一是有一个问题：它错误地变成单词辩论的每一个提成标记＆LT;＆辩论GT; 。来吧，搜索欢迎回到共和党presidential注意到一个明显的字不见了？

I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag<debate>. Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing?

酷，所以BeautifulSoup做清理杂乱的HTML和添加结束标记的出色的工作为他们的应的一直。但是，在这种情况下，渣土我，因为＆LT;辩论＆GT; 现在是一个孩子＆LT; P＆GT; 和结束＆LT; /辩论＆GT; 加末allllll的方式;因此，该嵌套标签内剩余的辩论。

Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate> is now a child of a <p> and the closing </debate> is added allllll the way at the end; thus nesting the remaining debate inside that tag.

我如何告诉BeautifulSoup忽略或删除＆LT;＆辩论GT; ？或者可以选择，我怎么后立即添加结束标记？我试过拆开包装，但什么时候可以调用它，BS已经成立了关闭标签底，从而做出以下段落的孩子，而不是兄弟姐妹。

How do I tell BeautifulSoup to either ignore or remove <debate>? Or alternatively, how do I add a closing tag immediately after? I've tried unwrap, but by the time I can call it, BS has already set up the closing tag at the end, and thus made following paragraphs children rather than siblings.

下面就是我如何设置：

from bs4 import BeautifulSoup
import urllib

bad_debate = 'http://www.presidency.ucsb.edu/ws/index.php?pid=111395'
file = urllib.urlopen(bad_debate)
soup = BeautifulSoup(file)

我的直觉是我需要插入网址通话和BeautifulSoup之间的事情，但对我的生活我无法弄清楚如何修改文件的内容。

My hunch is I need to insert something between the url call and BeautifulSoup, but for the life of me I can't figure out how to modify the file contents.

推荐答案

的 html5lib 解析器做一个更好的工作（不是 LXML 或 HTML .parser ）处理辩论在这种情况下，元素：

html5lib parser does a better job (than lxml or html.parser) handling the debate element in this case:

soup = BeautifulSoup(file, "html5lib")

下面是如何处理争论的上述部分：

Here is how it handles the mentioned part of the debate:

<p>
    <b>
     BARTIROMO:
    </b>
    Welcome back to the Republican presidential
    <debate>
     here in North Charleston. Right back to the questions. [
     <i>
      applause
     </i>
     ]
    </debate>
</p>

这篇关于我如何删除BeautifulSoup虚假标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我如何删除BeautifulSoup虚假标签 [英] How do I remove a spurious tag in BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我如何删除BeautifulSoup虚假标签 [英] How do I remove a spurious tag in BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭