用beautifulsoup进行深度解析 [英] Deep parse with beautifulsoup
问题描述
我尝试解析 https://www.drugbank.ca/drugs .想法是提取每种药物的所有药物名称和一些其他信息.如您所见,每个网页代表一个带有药物名称的表格,当我们点击药物名称时,便可以访问该药物信息. 假设我将保留以下代码来处理分页:
I try to parse https://www.drugbank.ca/drugs. The idea is to extract all the drug names and some additional informationfor each drug. As you can see each webpage represents a table with drug names and the when we hit the drugname we can access to this drug information. Let's say I will keep the following code to handle the pagination:
import requests
from bs4 import BeautifulSoup
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
#data = soup.select('name-head a')
#for link in data:
# href = 'https://www.drugbank.ca/drugs/' + link.get('href')
# pages_data(href)
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
问题在于,在每一页中,对于该页表中的每种药物,我都需要捕获: 姓名. 登记号. 结构化指示 通用处方产品,
The issue is that in each page, and for each drug in the table of this page I need to capture : Name. Accession Number. Structured Indications, Generic Prescription Products,
我使用了经典的request/beautifusoup,但是无法深入..
I used the classical request/beautifusoup but can't go deep ..
请提供一些帮助
推荐答案
使用requests
和BeautifulSoup
创建函数以从子页面获取数据
Create function with requests
and BeautifulSoup
to get data from subpage
import requests
from bs4 import BeautifulSoup
def get_details(url):
print('details:', url)
# get subpage
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get data on subpabe
dts = soup.findAll('dt')
dds = soup.findAll('dd')
# display details
for dt, dd in zip(dts, dds):
print(dt.text)
print(dd.text)
print('---')
print('---------------------------')
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get links to subpages
links = soup.select('strong a')
for link in links:
# exeecute function to get subpage
get_details('https://www.drugbank.ca' + link['href'])
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
这篇关于用beautifulsoup进行深度解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!