使用BeautifulSoup,Python 3.6获取完整的项目列表 [英] Fetch complete List of Items using BeautifulSoup, Python 3.6
问题描述
I am learning BeautifulSoup and I have choosen Link https://www.bundesbank.de/dynamic/action/en/statistics/time-series-databases/time-series-databases/743796/743796?treeAnchor=BANKEN&statisticType=BBK_ITS to scrape list of items for the topic "Banks and other financial corporations"
我需要在下方的项目及其子项目采用分层格式,如所附图片所示
I need below Items with their child items in hierarchical format as shown in attached image
- 银行
- 投资公司
- 截至2016年第二季度的保险公司和养老金
- 截至2016年第三季度的保险公司
- 截至2016年第三季度的养老金
- 付款统计
下面的代码尝试了一下,然后卡住了:
Below Code tried, after that stuck:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.bundesbank.de/dynamic/action/en/statistics/time-series-databases/time-series-databases/743796/743796?treeAnchor=BANKEN&statisticType=BBK_ITS'
result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')
s = soup.find("div", class= "statisticTree")
还希望将结果导出到CSV文件.
Also, wants to export results to CSV File.
是否可以导出如图所示的Parent-Child?
Is it possible to export Parent - Child as shown in image?
推荐答案
您可以借助返回节点链接文本和子项列表的函数来递归地完成此操作:
You can do it recursively with a help of a function returning a node link text and a list of children:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = 'https://www.bundesbank.de/en/statistics/time-series-databases/time-series-databases/743796/openAll?treeAnchor=BANKEN&statisticType=BBK_ITS'
result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')
def get_child_nodes(parent_node):
node_name = parent_node.a.get_text(strip=True)
result = {"name": node_name, "children": []}
children_list = parent_node.find('ul', recursive=False)
if not children_list:
return result
for child_node in children_list('li', recursive=False):
result["children"].append(get_child_nodes(child_node))
return result
pprint(get_child_nodes(soup.find("div", class_="statisticTree")))
请注意,以非递归方式进行列表项搜索非常重要( recursive=False
已设置),以防止它抓住孙子并从树上掉下来.
Note that it's important to make the list item searches in a non-recursive fashion (recursive=False
is set) in order to prevent it from grabbing grand-children and going down the tree.
打印:
{'children': [{'children': [{'children': [{'children': [{'children': [],
'name': 'Reserve '
'maintenance '
'in the euro '
'area'},
{'children': [],
'name': 'Reserve '
'maintenance '
'in Germany'}],
'name': 'Minimum reserves'},
...
{'children': [{'children': [], 'name': 'Bank accounts'},
{'children': [], 'name': 'Payment card functions'},
{'children': [], 'name': 'Accepting devices'},
{'children': [],
'name': 'Number of payment transactions'},
{'children': [],
'name': 'Value of payment transactions'},
{'children': [],
'name': 'Number of transactions per type of '
'terminal'},
{'children': [],
'name': 'Value of transactions per type of '
'terminal'},
{'children': [],
'name': 'Number of OTC transactions'},
{'children': [],
'name': 'Value of OTC transactions'},
{'children': [], 'name': 'Issuance of banknotes'}],
'name': 'Payments statistics'}],
'name': 'Banks'}
这篇关于使用BeautifulSoup,Python 3.6获取完整的项目列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!