python [lxml]-清除html标签 [英] python [lxml] - cleaning out html tags

查看:538
本文介绍了python [lxml]-清除html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

我将上面的(丑陋的)代码放在一起作为我对python land的最初尝试.我正在尝试使用lxml清洁器来清理几个html页面,所以最后我只剩下了文本,没有别的什么了,但是我尝试了一下,以上似乎没有这样的效果,我是尽管我在remove_tagslinks=True

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

有什么想法吗,也许我用lxml弄错了树?我以为这是在python中进行html解析的方法?

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

推荐答案

解决方案:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

但是这个帮助了我-将我所需的方式串联起来:

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

这篇关于python [lxml]-清除html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆