实际工作的Python html解析 [英] Python html parsing that actually works

查看:114
本文介绍了实际工作的Python html解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Python中解析一些html。有些方法实际上在......之前有效,但是现在没有解决方法,我实际上可以使用任何方法。

I'm trying to parse some html in Python. There were some methods that actually worked before... but nowadays there's nothing I can actually use without workarounds.


  • SGMLParser去之后beautifulsoup有问题

  • html5lib无法解析out there的一半

  • lxml试图对典型的html太正确(属性和标签不能包含未知的命名空间,或抛出异常,这意味着几乎没有Facebook连接的页面可以被解析)

还有其他的选择这些天? (如果它们支持xpath,那就太好了)

What other options are there these days? (if they support xpath, that would be great)

推荐答案

确保您使用 html 模块,当您使用 lxml 解析HTML时:

Make sure that you use the html module when you parse HTML with lxml:

>>> from lxml import html
>>> doc = """<html>
... <head>
...   <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>

所有错误&例外情况会消失,你将得到一个惊人的快速解析器,它比BeautifulSoup更经常处理HTML汤。

All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.

这篇关于实际工作的Python html解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆