Python Webscraping beautifulsoup 避免在 find_all() 中重复 [英] Python Webscraping beautifulsoup avoid repetition in find_all()
问题描述
我正在使用 beautifulsoup 在 Python 中进行网页抓取.我试图提取粗体或斜体或两者兼而有之的文本.考虑以下 HTML 片段.
<b><我>你好,世界</i></b>
如果我使用命令 sp.find_all(['i', 'b'])
,可以理解,我得到两个结果,一个对应于粗体,另一个对应于斜体.即
['<b><我>你好世界', '<我>你好世界']
我的问题是,有没有办法唯一地提取它并获取标签?.我想要的输出类似于 -
<块引用>标签:文本 - HelloWorld,标签名:[b,i]
请注意,比较文本并剔除不唯一出现的文本不是一个可行的选择,因为我可能会在文本中多次重复出现HelloWorld",而我想提取这些内容.
谢谢!
查找同时具有 和
的节点的最自然方法他们的祖先中有 XPath:
//node()[ancestor::i 或ancestor::b]
您可以使用 text()
来查找文本节点,或者使用 *
来查找元素,而不是 node()
,具体取决于具体情况.这不会选择任何重复项,也不关心 和
嵌套的顺序.
这个想法的问题是 BeautifulSoup 不支持 XPath.出于这个原因,我会使用 lxml 而不是 BeautifulSoup 进行网页抓取.
I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
</b>
</div>
If I use the command sp.find_all(['i', 'b'])
, understandably, I get two results, one corresponding to bold and the other to italics. i.e.
['< b>< i>HelloWorld< /i>< /b>', '< i>HelloWorld< /i>']
My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -
tag : text - HelloWorld, tagnames : [b,i]
Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.
Thanks!
The most natural way of finding nodes that have both <b>
and <i>
among their ancestors would be XPath:
//node()[ancestor::i or ancestor::b]
Instead of node()
you could use text()
to find text nodes, or *
to find elements, depending on the situation. This would not select any duplicates and it does not care in what order <i>
and <b>
are nested.
The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.
这篇关于Python Webscraping beautifulsoup 避免在 find_all() 中重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!