解析HTML与Python 2.7 - 的HTMLParser,SGMLParser中,还是美味的汤? [英] Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

查看:267
本文介绍了解析HTML与Python 2.7 - 的HTMLParser,SGMLParser中,还是美味的汤?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要做一些屏幕抓取与Python 2.7,和我有之间的差异没有上下文的HTMLParser 化SGMLParser 或美味的汤。

I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser, SGMLParser, or Beautiful Soup.

难道这些都试图解决同样的问题,或者他们存在不同的原因?这是最简单的,这是最强大的,以及哪些(如果有的话)是默认的选择吗?

Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice?

另外,请让我知道,如果我忽略了一个显著的选择。

Also, please let me know if I have overlooked a significant option.

编辑:我要指出,我没有特别的HTML解析经验丰富,我特别感兴趣的,这将让我感动的最快,对一个特定的解析HTML的目标站点。

I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

推荐答案

我使用,并建议 LXML pyquery 作为解析HTML。我不得不写一个网页抓取机器人数个月前,所有的流行的替代方案我试过,包括的HTMLParser BeautifulSoup ,我是 LXML 和 pyquery 。我没有尝试过的化SGMLParser 虽然。

I am using and would recommend lxml and pyquery for parsing HTML. I had to write a web scraping bot a few month ago and of all the popular alternatives I tried, including HTMLParser and BeautifulSoup, I went with lxml and the syntax sugar of pyquery. I haven't tried SGMLParser though.

有关我所看到的, LXML 或多或少的功能最丰富的图书馆相比,其替代品的底层C核心是相当高性能的。至于 pyquery ,我真的很喜欢它的jQuery的启发语法这使得导航DOM更加愉快。

For what I've seen, lxml is more or less the most feature-rich library and its underlying C core is quite performant when compared to its alternatives. As for pyquery, I really liked its jQuery-inspired syntax which makes navigating the DOM more enjoyable.

下面是一些资源,你可能会发现的情况下,用你决定试一试:

Here are some resources you might find useful in case you decide to give it a try:

  • lxml home page
  • pyquery home page
  • BeautifulSoup vs lxml benchmark
  • Win installer for pyquery built against Python 2.7 - I had a hard time setting up pyquery :)

嗯,这是我的2C :)我希望这有助于。

Well, that's my 2c :) I hope this helps.

这篇关于解析HTML与Python 2.7 - 的HTMLParser,SGMLParser中,还是美味的汤?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆