通过XPath解析HTML [英] Parse HTML via XPath

查看:76
本文介绍了通过XPath解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在.Net中,我发现了这个伟大的库, HtmlAgilityPack ,可以让你轻松解析非常好的使用XPath构建的HTML。我在.Net网站上使用过这么多年,但我不得不为我的Python,Ruby和其他项目寻找更痛苦的库。是否有人知道其他语言的类似库?

解决方案

在python中, ElementTidy 解析标签汤并生成一个元素树,该元素树允许使用XPath进行查询:

 >>> from elementtidy.TidyHTMLTreeBuilder将TidyHTMLTreeBuilder导入为TB 
>>> tb = TB()
>>> tb.feed(< p> Hello world)
>>> e = tb.close()
>>> e.find(.//{http://www.w3.org/1999/xhtml}p)


In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

解决方案

In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:

>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>

这篇关于通过XPath解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆