将lxml设置为默认的BeautifulSoup分析器 [英] Set lxml as default BeautifulSoup parser

查看:108
本文介绍了将lxml设置为默认的BeautifulSoup分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个网络抓取项目,并且遇到了速度问题。为了解决它,我想用lxml代替html.parser作为BeautifulSoup的解析器。我已经能够做到这一点:

  soup = bs4.BeautifulSoup(html,'lxml')

但我不想重复键入'lxml'每次我打电话给BeautifulSoup。有没有一种方法可以设置哪个解析器在我的程序开始时使用一次?解析方案

根据指定解析器使用文档页面:


BeautifulSoup构造函数的第一个参数是一个字符串或
打开文件句柄 - 您想要解析的标记。第二个参数是
如何分析标记。



如果不指定任何内容,您将获得最佳的HTML解析器
安装。 Beautiful Soup将lxml的解析器评为最好,然后是
html5lib,然后是Python的内置解析器。换句话说,

,只需在同一个python环境中安装 lxml 就可以使它成为默认的解析器。



尽管注意,解析器被认为是最佳实践方法。 解析器之间的差异可导致微妙如果你让 BeautifulSoup 自己选择最好的解析器,那么这些错误就很难调试。您还必须记住,您需要安装 lxml 。而且,如果你没有安装它,你甚至不会注意到它 - BeautifulSoup 只会得到下一个可用的解析器而不会引发任何错误。



如果您仍然不想明确指定解析器,至少要为将来自己或其他人使用您在项目中编写的代码做笔记README / documentation,并在您的项目需求中列出 lxml 以及 beautifulsoup4



另外:显式优于隐式。

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:

soup = bs4.BeautifulSoup(html, 'lxml')

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

解决方案

According to the Specifying the parser to use documentation page:

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

In other words, just installing lxml in the same python environment makes it a default parser.

Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.

If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.

Besides: "Explicit is better than implicit."

这篇关于将lxml设置为默认的BeautifulSoup分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆