解析源代码(Python)方法:Beautiful Soup、lxml、html5lib 的区别? [英] Parsing Source Code (Python) Approach: Beautiful Soup, lxml, html5lib difference?

查看:42
本文介绍了解析源代码(Python)方法:Beautiful Soup、lxml、html5lib 的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的 HTML 源代码,我想解析 (~200,000) 行,而且我很确定整个过程中存在一些糟糕的格式.我一直在研究一些解析器,似乎 Beautiful Soup、lxml、html5lib 是最受欢迎的.从这个网站看,lxml是最常用的,也是最快的,而Beautiful Soup的速度较慢,但​​会导致更多的错误和变化.

我对 Beautiful Soup 文档有点困惑,http://www.crummy.com/software/BeautifulSoup/bs4/doc/,以及诸如 BeautifulSoup(markup, "lxml") 或 BeautifulSoup(markup, html5lib) 之类的命令.在这种情况下,它是否同时使用 Beautiful Soup 和 html5lib/lxml?速度在这里不是真正的问题,但准确性才是.最终目标是使用 urllib2 解析获取源代码,并从文件中检索所有文本数据,就像我只是复制/粘贴网页一样.

附言无论如何,是否可以在不返回网页视图中不存在的任何空白的情况下解析文件?

解决方案

我的理解(曾将 BeautifulSoup 用于少数事情)是它是 lxml 或 html5lib 等解析器的包装器.使用指定的任何解析器(我相信默认值是 HTMLParser,python 的默认解析器),BeautifulSoup 创建一个标签元素树,这样可以很容易地导航和搜索 HTML 以获取在标签内继续的有用数据.如果您真的只需要网页中的文本,而不需要来自特定 HTML 标签的更具体的数据,您可能只需要类似于以下的代码片段:

from bs4 import BeautifulSoup导入 urllib2汤 = BeautifulSoup(urllib2.urlopen("http://www.google.com")汤.get_text()

get_text 对于复杂的网页并不是那么好(它偶尔会随机获取 javascript 或 css),但如果您掌握了如何使用 BeautifulSoup 的窍门,那么只获取您想要的文本应该不难.

出于您的目的,您似乎无需担心让其他解析器之一与 BeautifulSoup(html5lib 或 lxml)一起使用.BeautifulSoup 可以自己处理一些草率,如果不能,它会给出一个关于格式错误的 HTML"或类似内容的明显错误,这将表明安装 html5lib 或 lxml.

I have a large HTML source code I would like to parse (~200,000) lines, and I'm fairly certain there is some poor formatting throughout. I've been researching some parsers, and it seems Beautiful Soup, lxml, html5lib are the most popular. From reading this website, it seems lxml is the most commonly used and fastest, while Beautiful Soup is slower but accounts for more errors and variation.

I'm a little confused on the Beautiful Soup documentation, http://www.crummy.com/software/BeautifulSoup/bs4/doc/, and commands like BeautifulSoup(markup, "lxml") or BeautifulSoup(markup, html5lib). In such instances is it using both Beautiful Soup and html5lib/lxml? Speed is not really an issue here, but accuracy is. The end goal is to parse get the source code using urllib2, and retrieve all the text data from the file as if I were to just copy/paste the webpage.

P.S. Is there anyway to parse the file without returning any whitespace that were not present in the webpage view?

解决方案

My understanding (having used BeautifulSoup for a handful of things) is that it is a wrapper for parsers like lxml or html5lib. Using whichever parser is specified (I believe the default is HTMLParser, the default parser for python), BeautifulSoup creates a tree of tag elements and such that make it quite easy to navigate and search the HTML for useful data continued within tags. If you really just need the text from the webpages and not more specific data from specific HTML tags, you might only need a code snippet similar to this:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.google.com")
soup.get_text()

get_text isn't that great with complex webpages (it gets random javascript or css occasionally), but if you get the hang of how to use BeautifulSoup, it shouldn't be hard to get only the text you want.

For your purposes it seems like you don't need to worry about getting one of those other parsers to use with BeautifulSoup (html5lib or lxml). BeautifulSoup can deal with some sloppiness on its own, and if it can't, it will give an obvious error about "malformed HTML" or something of the sort, and that would be an indication to install html5lib or lxml.

这篇关于解析源代码(Python)方法:Beautiful Soup、lxml、html5lib 的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆