python [lxml]-清除html标签 [英] python [lxml] - cleaning out html tags

查看：538 发布时间：2020/5/4 8:20:46 python parsing lxml

本文介绍了python [lxml]-清除html标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

我将上面的(丑陋的)代码放在一起作为我对python land的最初尝试.我正在尝试使用lxml清洁器来清理几个html页面，所以最后我只剩下了文本，没有别的什么了，但是我尝试了一下，以上似乎没有这样的效果，我是尽管我在remove_tags和links=True

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

有什么想法吗，也许我用lxml弄错了树?我以为这是在python中进行html解析的方法?

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

推荐答案

解决方案:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

但是这个帮助了我-将我所需的方式串联起来:

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

这篇关于python [lxml]-清除html标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python [lxml]-清除html标签 [英] python [lxml] - cleaning out html tags

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python [lxml]-清除html标签 [英] python [lxml] - cleaning out html tags

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭