Python Webscraping beautifulsoup 避免在 find_all() 中重复 [英] Python Webscraping beautifulsoup avoid repetition in find_all()

查看：21 发布时间：2021/9/24 19:00:17 python html web-scraping beautifulsoup

本文介绍了Python Webscraping beautifulsoup 避免在 find_all() 中重复的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 beautifulsoup 在 Python 中进行网页抓取.我试图提取粗体或斜体或两者兼而有之的文本.考虑以下 HTML 片段.


<b><我>你好，世界</i></b>

如果我使用命令 sp.find_all(['i', 'b'])，可以理解，我得到两个结果，一个对应于粗体，另一个对应于斜体.即

['<我>你好世界', '<我>你好世界']

我的问题是，有没有办法唯一地提取它并获取标签?.我想要的输出类似于 -

<块引用>

标签:文本 - HelloWorld，标签名:[b,i]

请注意，比较文本并剔除不唯一出现的文本不是一个可行的选择，因为我可能会在文本中多次重复出现HelloWorld"，而我想提取这些内容.

谢谢！

解决方案

查找同时具有和 的节点的最自然方法他们的祖先中有 XPath:

//node()[ancestor::i 或ancestor::b]

您可以使用 text() 来查找文本节点，或者使用 * 来查找元素，而不是 node()，具体取决于具体情况.这不会选择任何重复项，也不关心和 嵌套的顺序.

这个想法的问题是 BeautifulSoup 不支持 XPath.出于这个原因，我会使用 lxml 而不是 BeautifulSoup 进行网页抓取.

I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.

<div> HelloWorld </div>

If I use the command sp.find_all(['i', 'b']), understandably, I get two results, one corresponding to bold and the other to italics. i.e.

['HelloWorld', 'HelloWorld']

My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -

tag : text - HelloWorld, tagnames : [b,i]

Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.

Thanks!
解决方案
The most natural way of finding nodes that have both  and  among their ancestors would be XPath:
//node()[ancestor::i or ancestor::b]
Instead of node() you could use text() to find text nodes, or * to find elements, depending on the situation. This would not select any duplicates and it does not care in what order  and  are nested.

The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.

这篇关于Python Webscraping beautifulsoup 避免在 find_all() 中重复的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python Webscraping beautifulsoup 避免在 find_all() 中重复 [英] Python Webscraping beautifulsoup avoid repetition in find_all()

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python Webscraping beautifulsoup 避免在 find_all() 中重复 [英] Python Webscraping beautifulsoup avoid repetition in find_all()

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭