lxml/BeautifulSoup解析器警告 [英] lxml / BeautifulSoup parser warning

查看:94
本文介绍了lxml/BeautifulSoup解析器警告的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Python 3,我尝试通过将lxml与BeautifulSoup结合使用来解析丑陋的HTML(不受我的控制),如下所述:

Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html

具体来说,我想使用lxml,但是我想使用BeautifulSoup,因为就像我说的那样,这是丑陋的HTML,lxml会自行拒绝它.

Specifically, I want to use lxml, but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own.

上面的链接说:您需要做的就是将其传递给fromstring()函数:"

The link above says: "All you need to do is pass it to the fromstring() function:"

from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)

这就是我在做什么:

URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)

在我可以很好地处理HTML的意义上,它有效.我的问题是,每次我运行脚本时,都会收到此烦人的警告:

It works in the sense that I can manipulate the HTML just fine after that. My problem is that every time I run the script, I receive this annoying warning:

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

我的问题也许很明显:我自己没有实例化BeautifulSoup.我尝试将建议的参数添加到fromstring函数中,但这只是给了我错误:TypeError: 'str' object is not callable.到目前为止,在线搜索被证明是徒劳的.

My problem is perhaps obvious: I'm not instantiating BeautifulSoup myself. I've tried adding the proposed parameter to the fromstring function, but that just gives me the error: TypeError: 'str' object is not callable. Searches online have proven fruitless so far.

我想摆脱该警告消息.感谢您的帮助,谢谢.

I'd like to get rid of that warning message. Help appreciated, thanks in advance.

推荐答案

我必须阅读lxml和BeautifulSoup的源代码才能弄清楚这一点.

I had to read lxml's and BeautifulSoup's source code to figure this out.

我在这里发布自己的答案,以防将来有人需要.

I'm posting my own answer here, in case someone else may need it in the future.

有问题的fromstring函数的定义如下:

The fromstring function in question is defined so:

def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):

**bsargs参数最终被发送到BeautifulSoup构造函数,该构造函数的调用方式如下(在另一个函数中,_parse):

The **bsargs arguments ends up being sent forward to the BeautifulSoup constructor, which is called like so (in another function, _parse):

tree = beautifulsoup(source, **bsargs)

BeautifulSoup构造函数的定义如下:

The BeautifulSoup constructor is defined so:

def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, exclude_encodings=None,
             **kwargs):

现在,回到问题的警告中,建议将参数"html.parser"添加到BeautifulSoup的构造函数中.据此,这将是名为features的参数.

Now, back to the warning in the question, which is recommending that the argument "html.parser" be added to BeautifulSoup's contructor. According to this, that would be the argument named features.

由于fromstring函数会将命名参数传递给BeautifulSoup的构造函数,因此我们可以通过将参数命名为fromstring函数来指定解析器,如下所示:

Since the fromstring function will pass on named arguments to BeautifulSoup's constructor, we can specify the parser by naming the argument to the fromstring function, like so:

root = fromstring(clean, features='html.parser')

of.警告消失.

这篇关于lxml/BeautifulSoup解析器警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆