使用 Python 2.7 解析 HTML - HTMLParser、SGMLParser 或 Beautiful Soup? [英] Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

查看:28
本文介绍了使用 Python 2.7 解析 HTML - HTMLParser、SGMLParser 或 Beautiful Soup?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 Python 2.7 做一些屏幕抓取,我没有关于 HTMLParserSGMLParser 或 Beautiful Soup 之间差异的上下文.

I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser, SGMLParser, or Beautiful Soup.

这些都是为了解决同样的问题,还是出于不同的原因而存在?哪个最简单,哪个最健壮,哪个(如果有)是默认选择?

Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice?

另外,如果我忽略了一个重要的选项,请告诉我.

Also, please let me know if I have overlooked a significant option.

我应该提一下,我在 HTML 解析方面并不是特别有经验,而且我特别感兴趣的是哪个能让我移动得最快,目标是在一个特定的地方解析 HTML网站.

I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

推荐答案

我正在使用并推荐 lxmlpyquery 来解析 HTML.几个月前我不得不编写一个网络抓取机器人,在我尝试过的所有流行替代方案中,包括 HTMLParserBeautifulSoup,我选择了 lxmlpyquery 的语法糖.我还没有尝试过 SGMLParser.

I am using and would recommend lxml and pyquery for parsing HTML. I had to write a web scraping bot a few month ago and of all the popular alternatives I tried, including HTMLParser and BeautifulSoup, I went with lxml and the syntax sugar of pyquery. I haven't tried SGMLParser though.

就我所见,lxml 或多或少是功能最丰富的库,与其替代品相比,它的底层 C 核心非常高效.至于 pyquery,我真的很喜欢它受 jQuery 启发的语法,这让浏览 DOM 变得更加有趣.

For what I've seen, lxml is more or less the most feature-rich library and its underlying C core is quite performant when compared to its alternatives. As for pyquery, I really liked its jQuery-inspired syntax which makes navigating the DOM more enjoyable.

如果您决定尝试一下,这里有一些您可能会觉得有用的资源:

Here are some resources you might find useful in case you decide to give it a try:

  • lxml home page
  • pyquery home page
  • BeautifulSoup vs lxml benchmark
  • Win installer for pyquery built against Python 2.7 - I had a hard time setting up pyquery :)

好吧,这就是我的 2c :) 我希望这会有所帮助.

Well, that's my 2c :) I hope this helps.

这篇关于使用 Python 2.7 解析 HTML - HTMLParser、SGMLParser 或 Beautiful Soup?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆