python [lxml]-清除html标签 [英] python [lxml] - cleaning out html tags
问题描述
from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
remove_tags = ['a', 'li', 'td'])
print (len(cleaner.clean_html(text))- len(text))
return cleaner.clean_html(text)
except:
print 'Error in clean_html'
print sys.exc_info()
return text
我将上面的(丑陋的)代码放在一起作为我对python land的最初尝试.我正在尝试使用lxml清洁器来清理几个html页面,所以最后我只剩下了文本,没有别的什么了,但是我尝试了一下,以上似乎没有这样的效果,我是尽管我在remove_tags
和links=True
I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags
and links=True
有什么想法吗,也许我用lxml弄错了树?我以为这是在python中进行html解析的方法?
any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?
推荐答案
解决方案:
import lxml.html
document = lxml.html.document_fromstring(html_string)
# internally does: etree.XPath("string()")(document)
print document.text_content()
但是这个帮助了我-将我所需的方式串联起来:
but this one helped me - concatenation the way I needed:
from lxml import etree
print "\n".join(etree.XPath("//text()")(document))
这篇关于python [lxml]-清除html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!