如何使用标准库解析python中的格式错误的HTML [英] How to parse malformed HTML in python, using standard libraries

查看:315
本文介绍了如何使用标准库解析python中的格式错误的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多 html和xml库内置到python ,很难相信不支持现实的HTML解析。

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

我发现了很多伟大的第三方库来完成这个任务,但这个问题是关于python标准库

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

要求:


  • 仅使用Python标准库组件(任何2。 x版)

  • DOM支持

  • 处理HTML实体(& nbsp;

  • 处理部分文件(如: Hello,< i> World< / i>!

  • Use only Python standard library components (any 2.x version)
  • DOM support
  • Handle HTML entities (&nbsp;)
  • Handle partial documents (like: Hello, <i>World</i>!)

奖励积分:


  • XPATH支持

  • 处理未封闭/格式错误的标签。 (< big>有人在此知道< html ???
  • XPATH support
  • Handle unclosed/malformed tags. (<big>does anyone here know <html ???

这是我所要求的90%的解决方案,这适用于我尝试过的有限的一组HTML,但是每个人都可以清楚地看到,这并不完全稳健,因为我通过盯着文档15分钟和一行代码,我想我可以参考stackoverflow社区一个类似但更好的解决方案...

Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))


推荐答案

可靠的是一个相对现代化的发展(可能看起来很奇怪),因此标准库中绝对没有任何内容。 HTMLParser 可能会出现来处理HTML,但不是 - 它失败o很多很常见的HTML,虽然你可以解决这些失败,但总是会有另一种情况,你没有想到(如果你真的成功处理每个失败,你将基本上重新创建BeautifulSoup)。

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

真正只有3种合理的方法来解析HTML(如在网络上找到的): lxml.html BeautifulSoup html5lib 。 lxml是迄今为止最快的,但安装有点棘手(在App Engine这样的环境中是不可能的)。 html5lib是基于HTML 5指定解析的方式;虽然在实践中与其他两个类似,但是在解析破坏的HTML时,它可能更为正确(他们都解析出相当不错的HTML)。他们在解析破碎的HTML时都做了一个可敬的工作。 BeautifulSoup可以方便,但我发现它的API不必要地古怪。

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

这篇关于如何使用标准库解析python中的格式错误的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆