在python中使用lxml解析HTML文档 [英] Parsing HTML documents using lxml in python

查看：89 发布时间：2021/5/30 21:52:57 python lxml

本文介绍了在python中使用lxml解析HTML文档的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我刚刚下载了lxml来解析损坏的HTML文档.我正在阅读lxml的文档，但是找不到给定的HTML文档，我们如何使用lxml检索文档中的文本.如果有人可以帮助我，我将承担义务.

I just downloaded lxml to parse broken HTML documents. I was reading through the documentation of lxml but could not find that given a HTML document, how do we just retrieve the text in the document using lxml. I will be obliged if someone could help me with this.

推荐答案

这很简单:

from lxml import html
html_document = ... #Get your document contents here from a file or whatever

tree = html.fromstring(html_document)
text_document = tree.text_content()

如果只需要特定块(例如，主体块)中的内容，则可以使用xpath表达式进行访问:

If you only want the content from specific blocks (e.g. the body block), then you can access them using xpath expressions:

body_tags = tree.xpath('//body')
if body_tags:
  body = body_tags[0]
  text_document = body.text_content()
else:
  text_document = ''

这篇关于在python中使用lxml解析HTML文档的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中使用lxml解析HTML文档 [英] Parsing HTML documents using lxml in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中使用lxml解析HTML文档 [英] Parsing HTML documents using lxml in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭