如何从HTML中使用BeautifulSoup剥离注释标签? [英] How can I strip comment tags from HTML using BeautifulSoup?
问题描述
我一直在玩BeautifulSoup,这是伟大的。我的最终目标是尝试,只是从一个页面中的文本。我只是想从正文文本,具有特殊的情况下拿到冠军和/或ALT从&LT属性; A>
或 < IMG>
标记。
到目前为止,我有这个 EDITED&安培;更新的当前code
:
汤= BeautifulSoup(页)
评论= soup.findAll(文=拉姆达文本:isinstance(文字,评论))
[comment.extract()在评论评论]
页=''。加入(soup.findAll(文= TRUE))
页=''。加入(page.split())
打印此页
1)你有什么建议给我的特殊情况下不排除从我上面列出的两个标签这些属性的最佳方式?如果它太复杂,为此,它并不像做#2一样重要。
2)我想脱光<! - - >在它们之间
标签和一切。我会如何呢?
问题修改
@jathanism:这里是我曾经试图剥离一些注释标记,但仍然存在,甚至当我用你的例子
<! - 开始函数弹出窗口(URL){天=新的日期(); ID = day.getTime();的eval(页+ ID += window.open(URL,'+身份证+','工具栏= 0,滚动条= 0,位置= 0,状态栏= 0,菜单= 0,可调整大小= 0,宽度= 300,高度= 330,左= 774,顶部= 518');); } //结束 - >
!< - VAR MenuBar1 =新Spry.Widget.MenuBar(MenuBar1,{imgDown预先:SpryAssets / SpryMenuBarDownHover.gif,imgRight:SpryAssets / SpryMenuBarRightHover.gif}); // - > !< - VAR MenuBar1 =新Spry.Widget.MenuBar(MenuBar1,{imgDown预先:SpryAssets / SpryMenuBarDownHover.gif,imgRight:SpryAssets / SpryMenuBarRightHover.gif}); // - > <! - VAR whichlink = 0 VAR whichimage = 0 VAR blenddelay =(IE)? document.images.slide.filters [0] .duration * 1000(!document.images)如果0功能slideit(){如果回报率(即)document.images.slide.filters [0]。适用()document.images .slide.src = imageHolder的[whichimage]的.src如果(即)document.images.slide.filters [0] .play()whichlink = whichimage whichimage =(whichimage&下; slideimages.length-1)? whichimage + 1:0的setTimeout(slideit(),slidespeed + blenddelay)} slideit()// - >
我仍然试图找出为什么
没有找到像这样带标签:
<! - // - >
。这些反斜线导致
某些标记被忽略。
块引用>这可能是潜在的SGML解析器了一个问题:看<一个href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps\" rel=\"nofollow\">http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps.您可以通过使用
markupMassage
正则表达式重写它 - 直接从文档:进口重,复制myMassage = [('!&LT; - ([^ - ])'re.compile(),拉姆达匹配:'!&LT; - '+ match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)BeautifulSoup(BADSTRING,markupMassage = myNewMassage)
#美孚&LT;! - 这个注释畸形.--&GT;酒吧和LT; BR /&GT;巴兹I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from
<a>
or<img>
tags.So far I have this
EDITED & UPDATED CURRENT CODE
:soup = BeautifulSoup(page) comments = soup.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] page = ''.join(soup.findAll(text=True)) page = ' '.join(page.split()) print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip
<!-- -->
tags and everything in between them. How would I go about that?
QUESTION EDIT
@jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End --> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
解决方案I am still trying to figure out why it doesn't find and strip tags like this:
<!-- //-->
. Those backslashes cause certain tags to be overlooked.This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a
markupMassage
regex -- straight from the docs:import re, copy myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))] myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) myNewMassage.extend(myMassage) BeautifulSoup(badString, markupMassage=myNewMassage) # Foo<!--This comment is malformed.-->Bar<br />Baz
这篇关于如何从HTML中使用BeautifulSoup剥离注释标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!