lxml/BeautifulSoup解析器警告 [英] lxml / BeautifulSoup parser warning
问题描述
使用Python 3,我尝试通过将lxml
与BeautifulSoup结合使用来解析丑陋的HTML(不受我的控制),如下所述:
Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml
with BeautifulSoup as explained here: http://lxml.de/elementsoup.html
具体来说,我想使用lxml
,但是我想使用BeautifulSoup,因为就像我说的那样,这是丑陋的HTML,lxml
会自行拒绝它.
Specifically, I want to use lxml
, but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml
will reject it on its own.
上面的链接说:您需要做的就是将其传递给fromstring()函数:"
The link above says: "All you need to do is pass it to the fromstring() function:"
from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)
这就是我在做什么:
URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)
在我可以很好地处理HTML的意义上,它有效.我的问题是,每次我运行脚本时,都会收到此烦人的警告:
It works in the sense that I can manipulate the HTML just fine after that. My problem is that every time I run the script, I receive this annoying warning:
/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "html.parser")
markup_type=markup_type))
我的问题也许很明显:我自己没有实例化BeautifulSoup.我尝试将建议的参数添加到fromstring
函数中,但这只是给了我错误:TypeError: 'str' object is not callable
.到目前为止,在线搜索被证明是徒劳的.
My problem is perhaps obvious: I'm not instantiating BeautifulSoup myself. I've tried adding the proposed parameter to the fromstring
function, but that just gives me the error: TypeError: 'str' object is not callable
. Searches online have proven fruitless so far.
我想摆脱该警告消息.感谢您的帮助,谢谢.
I'd like to get rid of that warning message. Help appreciated, thanks in advance.
推荐答案
我必须阅读lxml
和BeautifulSoup的源代码才能弄清楚这一点.
I had to read lxml
's and BeautifulSoup's source code to figure this out.
我在这里发布自己的答案,以防将来有人需要.
I'm posting my own answer here, in case someone else may need it in the future.
有问题的fromstring
函数的定义如下:
The fromstring
function in question is defined so:
def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):
**bsargs
参数最终被发送到BeautifulSoup构造函数,该构造函数的调用方式如下(在另一个函数中,_parse
):
The **bsargs
arguments ends up being sent forward to the BeautifulSoup constructor, which is called like so (in another function, _parse
):
tree = beautifulsoup(source, **bsargs)
BeautifulSoup构造函数的定义如下:
The BeautifulSoup constructor is defined so:
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
**kwargs):
现在,回到问题的警告中,建议将参数"html.parser"添加到BeautifulSoup的构造函数中.据此,这将是名为features
的参数.
Now, back to the warning in the question, which is recommending that the argument "html.parser" be added to BeautifulSoup's contructor. According to this, that would be the argument named features
.
由于fromstring
函数会将命名参数传递给BeautifulSoup的构造函数,因此我们可以通过将参数命名为fromstring
函数来指定解析器,如下所示:
Since the fromstring
function will pass on named arguments to BeautifulSoup's constructor, we can specify the parser by naming the argument to the fromstring
function, like so:
root = fromstring(clean, features='html.parser')
of.警告消失.
这篇关于lxml/BeautifulSoup解析器警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!