将lxml设置为默认的BeautifulSoup分析器 [英] Set lxml as default BeautifulSoup parser
问题描述
我正在开发一个网络抓取项目,并且遇到了速度问题。为了解决它,我想用lxml代替html.parser作为BeautifulSoup的解析器。我已经能够做到这一点:
soup = bs4.BeautifulSoup(html,'lxml')
但我不想重复键入'lxml'
每次我打电话给BeautifulSoup。有没有一种方法可以设置哪个解析器在我的程序开始时使用一次?解析方案
根据指定解析器使用文档页面:
BeautifulSoup构造函数的第一个参数是一个字符串或
打开文件句柄 - 您想要解析的标记。第二个参数是
如何分析标记。
如果不指定任何内容,您将获得最佳的HTML解析器
安装。 Beautiful Soup将lxml的解析器评为最好,然后是
html5lib,然后是Python的内置解析器。换句话说,
,只需在同一个python环境中安装 lxml
就可以使它成为默认的解析器。
尽管注意,解析器被认为是最佳实践方法。 解析器之间的差异可导致微妙如果你让 BeautifulSoup
自己选择最好的解析器,那么这些错误就很难调试。您还必须记住,您需要安装 lxml
。而且,如果你没有安装它,你甚至不会注意到它 - BeautifulSoup
只会得到下一个可用的解析器而不会引发任何错误。
如果您仍然不想明确指定解析器,至少要为将来自己或其他人使用您在项目中编写的代码做笔记README / documentation,并在您的项目需求中列出 lxml
以及 beautifulsoup4
。
另外:显式优于隐式。
I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:
soup = bs4.BeautifulSoup(html, 'lxml')
but I don't want to have to repeatedly type 'lxml'
every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?
According to the Specifying the parser to use documentation page:
The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
In other words, just installing lxml
in the same python environment makes it a default parser.
Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup
choose the best parser by itself. You would also have to remember that you need to have lxml
installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup
would just get the next available parser without throwing any errors.
If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml
in your project requirements alongside with beautifulsoup4
.
Besides: "Explicit is better than implicit."
这篇关于将lxml设置为默认的BeautifulSoup分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!