用beautifulsoup进行深度解析 [英] Deep parse with beautifulsoup

查看:76
本文介绍了用beautifulsoup进行深度解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试解析 https://www.drugbank.ca/drugs .想法是提取每种药物的所有药物名称和一些其他信息.如您所见,每个网页代表一个带有药物名称的表格,当我们点击药物名称时,便可以访问该药物信息. 假设我将保留以下代码来处理分页:

I try to parse https://www.drugbank.ca/drugs. The idea is to extract all the drug names and some additional informationfor each drug. As you can see each webpage represents a table with drug names and the when we hit the drugname we can access to this drug information. Let's say I will keep the following code to handle the pagination:

import requests
from bs4 import BeautifulSoup

def drug_data():
url = 'https://www.drugbank.ca/drugs/'

while url:
    print(url)
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    #data = soup.select('name-head a')
    #for link in data:
    #    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    #    pages_data(href)

    # next page url
    url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
    print(url)
    if url:
        url = 'https://www.drugbank.ca' + url[0].get('href')
    else:
        break

  drug_data()

问题在于,在每一页中,对于该页表中的每种药物,我都需要捕获: 姓名. 登记号. 结构化指示 通用处方产品,

The issue is that in each page, and for each drug in the table of this page I need to capture : Name. Accession Number. Structured Indications, Generic Prescription Products,

我使用了经典的request/beautifusoup,但是无法深入..

I used the classical request/beautifusoup but can't go deep ..

请提供一些帮助

推荐答案

使用requestsBeautifulSoup创建函数以从子页面获取数据

Create function with requests and BeautifulSoup to get data from subpage

import requests
from bs4 import BeautifulSoup

def get_details(url):
    print('details:', url)

    # get subpage
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    # get data on subpabe
    dts = soup.findAll('dt')
    dds = soup.findAll('dd')

    # display details
    for dt, dd in zip(dts, dds):
        print(dt.text)
        print(dd.text)
        print('---')

    print('---------------------------')

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")

        # get links to subpages
        links = soup.select('strong a')
        for link in links:
            # exeecute function to get subpage
            get_details('https://www.drugbank.ca' + link['href'])

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break

drug_data()

这篇关于用beautifulsoup进行深度解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆