用beautifulsoup进行深度解析 [英] Deep parse with beautifulsoup

查看：76 发布时间：2020/5/25 1:11:24 python python-2.7 parsing beautifulsoup

本文介绍了用beautifulsoup进行深度解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试解析 https://www.drugbank.ca/drugs .想法是提取每种药物的所有药物名称和一些其他信息.如您所见，每个网页代表一个带有药物名称的表格，当我们点击药物名称时，便可以访问该药物信息. 假设我将保留以下代码来处理分页:

I try to parse https://www.drugbank.ca/drugs. The idea is to extract all the drug names and some additional informationfor each drug. As you can see each webpage represents a table with drug names and the when we hit the drugname we can access to this drug information. Let's say I will keep the following code to handle the pagination:

import requests
from bs4 import BeautifulSoup

def drug_data():
url = 'https://www.drugbank.ca/drugs/'

while url:
    print(url)
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    #data = soup.select('name-head a')
    #for link in data:
    #    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    #    pages_data(href)

    # next page url
    url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
    print(url)
    if url:
        url = 'https://www.drugbank.ca' + url[0].get('href')
    else:
        break

  drug_data()

问题在于，在每一页中，对于该页表中的每种药物，我都需要捕获: 姓名. 登记号. 结构化指示通用处方产品，

The issue is that in each page, and for each drug in the table of this page I need to capture : Name. Accession Number. Structured Indications, Generic Prescription Products,

我使用了经典的request/beautifusoup，但是无法深入..

I used the classical request/beautifusoup but can't go deep ..

请提供一些帮助

推荐答案

使用requests和BeautifulSoup创建函数以从子页面获取数据

Create function with requests and BeautifulSoup to get data from subpage

import requests
from bs4 import BeautifulSoup

def get_details(url):
    print('details:', url)

    # get subpage
    r = requests.get(url)
    soup = BeautifulSoup(r.text ,"lxml")

    # get data on subpabe
    dts = soup.findAll('dt')
    dds = soup.findAll('dd')

    # display details
    for dt, dd in zip(dts, dds):
        print(dt.text)
        print(dd.text)
        print('---')

    print('---------------------------')

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")

        # get links to subpages
        links = soup.select('strong a')
        for link in links:
            # exeecute function to get subpage
            get_details('https://www.drugbank.ca' + link['href'])

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break

drug_data()

这篇关于用beautifulsoup进行深度解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用beautifulsoup进行深度解析 [英] Deep parse with beautifulsoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用beautifulsoup进行深度解析 [英] Deep parse with beautifulsoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭