Python Webscraping beautifulsoup 避免在 find_all() 中重复 [英] Python Webscraping beautifulsoup avoid repetition in find_all()

查看:21
本文介绍了Python Webscraping beautifulsoup 避免在 find_all() 中重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 beautifulsoup 在 Python 中进行网页抓取.我试图提取粗体或斜体或两者兼而有之的文本.考虑以下 HTML 片段.

<b><我>你好,世界</i></b>

如果我使用命令 sp.find_all(['i', 'b']),可以理解,我得到两个结果,一个对应于粗体,另一个对应于斜体.即

['<b><我>你好世界', '<我>你好世界']

我的问题是,有没有办法唯一地提取它并获取标签?.我想要的输出类似于 -

<块引用>

标签:文本 - HelloWorld,标签名:[b,i]

请注意,比较文本并剔除不唯一出现的文本不是一个可行的选择,因为我可能会在文本中多次重复出现HelloWorld",而我想提取这些内容.

谢谢!

解决方案

查找同时具有 的节点的最自然方法他们的祖先中有 XPath:

//node()[ancestor::i 或ancestor::b]

您可以使用 text() 来查找文本节点,或者使用 * 来查找元素,而不是 node(),具体取决于具体情况.这不会选择任何重复项,也不关心 嵌套的顺序.

这个想法的问题是 BeautifulSoup 不支持 XPath.出于这个原因,我会使用 lxml 而不是 BeautifulSoup 进行网页抓取.

I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.

<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>

If I use the command sp.find_all(['i', 'b']), understandably, I get two results, one corresponding to bold and the other to italics. i.e.

['< b>< i>HelloWorld< /i>< /b>', '< i>HelloWorld< /i>']

My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -

tag : text - HelloWorld, tagnames : [b,i]

Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.

Thanks!

解决方案

The most natural way of finding nodes that have both <b> and <i> among their ancestors would be XPath:

//node()[ancestor::i or ancestor::b]

Instead of node() you could use text() to find text nodes, or * to find elements, depending on the situation. This would not select any duplicates and it does not care in what order <i> and <b> are nested.

The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.

这篇关于Python Webscraping beautifulsoup 避免在 find_all() 中重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆