如何编写只解析标签之间带有特定文本的对象的BeautifulSoup过滤器? [英] How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

查看:36
本文介绍了如何编写只解析标签之间带有特定文本的对象的BeautifulSoup过滤器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Django和Python 3.7.我想进行更有效的解析,因此我正在阅读SoupStrainer对象.我创建了一个自定义变量,以帮助我仅解析所需的元素...

  def my_custom_strainer(self,elem,attrs):对于attrs中的attr:print("attr:" + attr +"=" + attrs [attr])如果elem == attr和attrs ['class'] =="score"中的'div'和'class':返回Trueelif elem =="span"和elem.text == re.compile(我的文本"):返回Truearticle_stat_page_strainer = SoupStrainer(self.my_custom_strainer)汤= BeautifulSoup(html,features ="html.parser",parse_only = article_stat_page_strainer) 

条件之一是我只想解析其文本与特定模式匹配的"span"元素.因此,

  elem =="span"和elem.text == re.compile("my text") 

条款.但是,这导致

  AttributeError:'str'对象没有属性'text' 

当我尝试运行上面的命令时出现

错误.写我的过滤器的正确方法是什么?

解决方案

TLDR; 不,目前在BeautifulSoup中不容易做到(需要修改BeautifulSoup和SoupStrainer对象).

说明:

问题在于,通过Strainer传递的函数在 handle_starttag()方法上被调用.您可能会猜到,开始标记中只有值(例如元素名称和属性).

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

  if(self.parse_only和len(self.tagStack)< = 1和(self.parse_only.text是否self.parse_only.search_tag(name,attrs))):不返回 

并且您可以看到,如果您的Strainer函数返回False,则该元素将立即被丢弃,而没有机会考虑内部的文本(不幸的是).

另一方面,如果您要添加文本"进行搜索.

  SoupStrainer(text ="my text") 

它将开始在标签内搜索文本,但这没有元素或属性的上下文-您可以看到具有讽刺意味的:/

并将其组合在一起将一无所获.而且您甚至无法像find函数中所示的那样访问父对象: https://gist.github.com/RichardBronosky/4060082

因此,目前Strainers可以很好地过滤元素/属性.您需要更改许多漂亮的汤代码才能使其正常工作.

如果您确实需要此功能,建议您继承BeautifulSoup和SoupStrainer对象并修改其行为.

I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...

def my_custom_strainer(self, elem, attrs):
    for attr in attrs:
        print("attr:" + attr + "=" + attrs[attr])
    if elem == 'div' and 'class' in attr and attrs['class'] == "score":
        return True
    elif elem == "span" and elem.text == re.compile("my text"):
        return True

article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the

elem == "span" and elem.text == re.compile("my text")

clause. However, this results in an

AttributeError: 'str' object has no attribute 'text'

error when I try and run the above. What's the proper way to write my strainer?

解决方案

TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).

Explanation:

The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).

On the other hand if you add "text" to search.

SoupStrainer(text="my text")

it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/

and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082

So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.

If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.

这篇关于如何编写只解析标签之间带有特定文本的对象的BeautifulSoup过滤器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆