解析来源$ C ​​$ C(Python)的做法:美丽的汤,LXML,html5lib区别? [英] Parsing Source Code (Python) Approach: Beautiful Soup, lxml, html5lib difference?

查看:247
本文介绍了解析来源$ C ​​$ C(Python)的做法:美丽的汤,LXML,html5lib区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的HTML源代码code我想解析(〜200,000)线和我相当肯定有一些贯穿差格式。我一直在研究一些解析器,似乎美丽的汤,LXML,html5lib是最流行的。从阅读这个网站,它似乎是LXML最常用的和最快的,而美丽的汤是速度较慢,但​​占更多的错误和偏差。

I have a large HTML source code I would like to parse (~200,000) lines, and I'm fairly certain there is some poor formatting throughout. I've been researching some parsers, and it seems Beautiful Soup, lxml, html5lib are the most popular. From reading this website, it seems lxml is the most commonly used and fastest, while Beautiful Soup is slower but accounts for more errors and variation.

我是美丽的汤文档, HTTP上有点糊涂:// WWW .crummy.com /软件/ BeautifulSoup / BS4 / DOC / 和命令,像BeautifulSoup(标记LXML)或BeautifulSoup(标记,html5lib)。在这种情况下是不是同时使用美丽的汤和html5lib / LXML?速度不是一个真正的问题在这里,但准确度。最终的目标是要通过解析得到的urllib2源$ C ​​$ C,并从文件检索的所有文字数据,如果我把刚才复制/粘贴网页。

I'm a little confused on the Beautiful Soup documentation, http://www.crummy.com/software/BeautifulSoup/bs4/doc/, and commands like BeautifulSoup(markup, "lxml") or BeautifulSoup(markup, html5lib). In such instances is it using both Beautiful Soup and html5lib/lxml? Speed is not really an issue here, but accuracy is. The end goal is to parse get the source code using urllib2, and retrieve all the text data from the file as if I were to just copy/paste the webpage.

P.S。反正是有解析文件,而无需返回了未present在网页视图中的任何空白?

P.S. Is there anyway to parse the file without returning any whitespace that were not present in the webpage view?

推荐答案

我的理解(已使用BeautifulSoup的事情少数)是,它像LXML或html5lib解析器的包装。使用指定了(我相信默认为HTMLParser的,对于Python默认解析器)为准解析器,BeautifulSoup创建标签元素和树,这样使它很容易浏览和搜索HTML标签中获取有用数据持续。如果你真的只需要在网页上的文本,并从指定的HTML标记没有更具体的数据,你可能只需要一个类似的code片断:

My understanding (having used BeautifulSoup for a handful of things) is that it is a wrapper for parsers like lxml or html5lib. Using whichever parser is specified (I believe the default is HTMLParser, the default parser for python), BeautifulSoup creates a tree of tag elements and such that make it quite easy to navigate and search the HTML for useful data continued within tags. If you really just need the text from the webpages and not more specific data from specific HTML tags, you might only need a code snippet similar to this:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.google.com")
soup.get_text()

get_text不复杂的网页,伟大的(它变得随机JavaScript或CSS偶尔),但如果你的如何使用BeautifulSoup的窍门,它不应该是很难得只有你想要的文字。

get_text isn't that great with complex webpages (it gets random javascript or css occasionally), but if you get the hang of how to use BeautifulSoup, it shouldn't be hard to get only the text you want.

有关您的目的好像你不必担心会与其他解析器之一,BeautifulSoup(html5lib或LXML)来使用。 BeautifulSoup能够处理自身的一些草率,如果不能,它会给关于恶意HTML或诸如此类的东西一个明显的错误,这将是安装html5lib或lxml的指示。

For your purposes it seems like you don't need to worry about getting one of those other parsers to use with BeautifulSoup (html5lib or lxml). BeautifulSoup can deal with some sloppiness on its own, and if it can't, it will give an obvious error about "malformed HTML" or something of the sort, and that would be an indication to install html5lib or lxml.

这篇关于解析来源$ C ​​$ C(Python)的做法:美丽的汤,LXML,html5lib区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆