快速有效的方法来解析破碎的HTML? [英] Fast and effective way to parse broken HTML?

查看:233
本文介绍了快速有效的方法来解析破碎的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



目前lxml是我的选择,我知道它提供了一个界面对于libxml2的恢复模式,但我对结果并不满意。对于一些特定的HTML页面,我发现BeautifulSoup可以找到更好的结果(例如: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/ ,这个包含< header> 标签,其中lxml / libxml2无法更正)。但是,问题是BS很慢。



正如我所看到的,像Chrome和Firefox这样的现代浏览器可以非常快速地解析HTML,并且很好地处理破碎的HTML。就像lxml一样,Chrome的解析器是建立在libxml2和libxslt的基础上的,但是更有效的破坏了HTML处理算法。我希望从Chromium中导出独立的回购,以便我可以使用它们,但还没有找到类似的东西。



有谁知道一个好的lib或至少一种解决方法(利用当前已知的解析器的一部分)?非常感谢!

解决方案

BeautifulSoup 破碎的HTML汤美丽。您可以通过让它解析得更快使用 lxml.html 隐藏


不要使用lxml作为底层解析器,我的建议是开始。美丽的汤使用lxml比使用html.parser或html5lib更快地解析文档。




  soup = BeautifulSoup(html,lxml)

另一个优化可能是 SoupStrainer - 仅解析HTML文档的所需部分,但我不确定它是否适用于您的用例。

您也可以通过安装 cchardet


您可以显着加速编码检测安装 cchardet 库。


文档参考资料
$ b





正如我所看到的,像Chrome和Firefox这样的现代浏览器能够非常快速地解析HTML,并很好地处理破碎的HTML。


据我所知,这是一项巨大的开销,但只是为了增加一些选项 - 您可以启动 Chrome 通过 selenium ,导航到所需的地址(或打开本地html文件),然后从 .page_source 转储HTML:

 来自selenium import webdriver 

driver = webdriver.Chrome()
driver.get(url)

#可能是一个延迟,或者在这里需要一个明确的等待

print(driver.page_source)

driver.close()






另见:


I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages.

Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the problem is BS is extremely slow.

As I see, modern browsers like Chrome and Firefox parse HTML very quickly and handle broken HTML really well. Like lxml, Chrome's parser is built on top of libxml2 and libxslt, but with more effective broken HTML handling algorithm. I hope there will be standalone repos exported from Chromium so that I can use them, but haven't found anything similar yet.

Does anyone know a good lib or at least a workaround (by utilizing parts of current known parsers)? Thanks a lot!

解决方案

BeautifulSoup does a really good job making the broken HTML soup beautiful. You can make the parsing faster by letting it use lxml.html under the hood:

If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib.

soup = BeautifulSoup(html, "lxml")

The other optimization might be the SoupStrainer - parsing only a desired part of an HTML document, but I'm not sure if it's applicable in your use case.

You can also speed things up by installing cchardet library:

You can speed up encoding detection significantly by installing the cchardet library.

Documentation reference.


As I see, modern browsers like Chrome and Firefox parse HTML very quickly and handle broken HTML really well.

I understand that this is a huge overhead, but just to add something to your options - you can fire up Chrome via selenium, navigate to the desired address (or open up a local html file) and dump the HTML back from the .page_source:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("url")

# may be a delay or an explicit wait would be needed here

print(driver.page_source)

driver.close()


Also see:

这篇关于快速有效的方法来解析破碎的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆