使用BeautifulSoup分页 [英] Pagination with BeautifulSoup

查看:62
本文介绍了使用BeautifulSoup分页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下网站获取一些数据. https://www.drugbank.ca/drugs

I am trying to get some data from the following website. https://www.drugbank.ca/drugs

对于表中的每种药物,我都需要深入了解其名称和一些其他特定功能,例如类别,结构化指示(请单击药物名称以查看我将使用的功能).

For every drug in the table, I will need to go deeply and have the name and some other specific features like categories, structured indication (please click on drug name to see the features I will use).

我编写了以下代码,但问题是我无法使我的代码处理分页(因为您看到的页面超过2000页!).

I wrote the following code but the issue that I can't make my code handle pagination (as you see there more than 2000 pages!).

import requests
from bs4 import BeautifulSoup


def drug_data():
url = 'https://www.drugbank.ca/drugs/'
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
for link in soup.select('name-head a'):
    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    pages_data(href)


def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text, "lxml")
g_data = soup.select('div.content-container')

for item in g_data:
    print item.contents[1].text
    print item.contents[3].findAll('td')[1].text
    try:
        print item.contents[5].findAll('td',{'class':'col-md-2 col-sm-4'})
    [0].text
    except:
        pass
    print item_url
    drug_data()

如何抓取所有数据并正确处理分页?

How can I scrape all of the data and handle pagination properly?

推荐答案

此页面对所有页面使用几乎相同的url,因此您可以使用 for 循环来生成它们

This page use almost the same url for all pages so you can use for loop to generate them

def drug_data(page_number):
    url = 'https://www.drugbank.ca/drugs/?page=' + str(page_number)
    ... rest ...

# --- later ---

for x in range(1, 2001):
    drug_data(x)

或者使用 while try/except 获得超过2000页的页面

Or using while and try/except to get more then 2000 pages

# --- later ---
page = 0

while True:
    try:
        page += 1
        drug_data(page)
    except Exception as ex:
        print(ex)
        print("probably last page:", page)
        break # exit `while` loop

您还可以在HTML中找到下一页的网址

You can also find url to next page in HTML

<a rel="next" class="page-link" href="/drugs?approved=1&amp;c=name&amp;d=up&amp;page=2">›</a>

因此您可以使用 BeautifulSoup 来获取并使用此链接.

so you can use BeautifulSoup to get this link and use it.

它显示当前网址,找到指向下一页的链接(使用 class ="page-link" rel ="next" )并加载

It displays current url, finds link to next page (using class="page-link" rel="next") and loads it

import requests
from bs4 import BeautifulSoup

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")

        #data = soup.select('name-head a')
        #for link in data:
        #    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
        #    pages_data(href)

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break

drug_data()


顺便说一句:永远不要使用 except:pass ,因为您可能会遇到意想不到的错误,并且也不知道为什么它不起作用.更好的显示错误


BTW: never use except:pass because you can have error which you didn't expect and you will not know why it doesn't work. Better display error

 except Exception as ex:
      print('Error:',  ex)

这篇关于使用BeautifulSoup分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆