如何使用 BeautifulSoup 从 HTML 中删除评论标签? [英] How can I strip comment tags from HTML using BeautifulSoup?

查看:26
本文介绍了如何使用 BeautifulSoup 从 HTML 中删除评论标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在玩 BeautifulSoup,这很棒.我的最终目标是尝试从页面中获取文本.我只是想从正文中获取文本,使用特殊情况从 .您可以使用 markupMassage 正则表达式来覆盖它——直接来自文档:

重新导入,复制myMassage = [(re.compile('<!-([^-])'), lambda 匹配:'<!--' + match.group(1))]myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)myNewMassage.extend(myMassage)BeautifulSoup(badString, markupMassage=myNewMassage)# Foo<!--此评论格式错误.-->Bar<br/>Baz

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.

So far I have this EDITED & UPDATED CURRENT CODE:

soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page

1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.

2) I would like to strip<!-- --> tags and everything in between them. How would I go about that?

QUESTION EDIT @jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example

<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->

解决方案

I am still trying to figure out why it doesn't find and strip tags like this: <!-- //-->. Those backslashes cause certain tags to be overlooked.

This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:

import re, copy

myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)

BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

这篇关于如何使用 BeautifulSoup 从 HTML 中删除评论标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆