拼抢不同元素与BeautifulSoup：避免嵌套元素复制 [英] Grabbing different elements with BeautifulSoup: avoid duplicating in nested elements

查看：264 发布时间：2016/8/5 19:09:02 python beautifulsoup html5lib

本文介绍了拼抢不同元素与BeautifulSoup：避免嵌套元素复制的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想抓住不同的内容（类）从LOKAL使用BeautifulSoup4保存的网站（Python文档），所以我用这个code做这件事（的index.html是这样保存的网站：的 https://docs.python.org/3/library/stdtypes.html ）

i want to grab different content (classes) from an lokal saved website (the python documentation) using BeautifulSoup4, so i use this code for doing that (index.html is this saved website: https://docs.python.org/3/library/stdtypes.html )

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
f = open('test.html','w')
f.truncate
classes= soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
print(classes,file=f) 
f.close()

的的FileHandler只对结果输出，并且对问题本身没有影响。

The filehandler is only for result-output and has no effect on the problem itself.

我的问题是，该结果被嵌套。例如方法__eq__（出口商），会发现一类的内部和2作为一个独立的方法。

My problem is that the results are nested. For example the method "__eq__ (exporter) will be found 1. inside of the class and 2. as a method standalone.

所以，我想删除所有的结果内的其他结果在同一个层级的每个结果。我怎样才能做到这一点？或者是它甚至有可能为忽略，在第一个步骤的内容？我希望你明白我的意思。

So i want to remove all the results inside of other results to have every result on the same hierarchical level . How can i do this? Or is it even possible to "ignore" that content in the first step? I hope you understand what i mean.

推荐答案

您不能告诉找到忽略嵌套的 DL 元素;所有你能做的就是忽略 .descendants 出现在比赛中：

You cannot tell find to ignore nested dl elements; all you can do is ignore matches that appear in the .descendants:

matches = []
for dl in soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
    if any(dl in m.descendants for m in matches):
        # child of already found element
        continue
    matches.append(dl)

如果你想嵌套元素和没有父母，用途：

If you want nested elements and no parents, use:

matches = []
for dl in soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
    matches = [m for m in matches if dl not in m.descendants]
    matches.append(dl)

如果你想拉开树和的删除的从树中的元素，使用：

If you wanted to pull apart the tree and remove elements from the tree, use:

matches = soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
for element in matches:
    element.extract()  # remove from tree (and parent `dl` matches)

但您可能需要调整您的文本，而不是提取

but you may want to adjust your text extracting instead.

这篇关于拼抢不同元素与BeautifulSoup：避免嵌套元素复制的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

拼抢不同元素与BeautifulSoup：避免嵌套元素复制 [英] Grabbing different elements with BeautifulSoup: avoid duplicating in nested elements

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

拼抢不同元素与BeautifulSoup：避免嵌套元素复制 [英] Grabbing different elements with BeautifulSoup: avoid duplicating in nested elements

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭