使用BeautifulSoup分页 [英] Pagination with BeautifulSoup
问题描述
我正在尝试从以下网站获取一些数据. https://www.drugbank.ca/drugs
I am trying to get some data from the following website. https://www.drugbank.ca/drugs
对于表中的每种药物,我都需要深入了解其名称和一些其他特定功能,例如类别,结构化指示(请单击药物名称以查看我将使用的功能).
For every drug in the table, I will need to go deeply and have the name and some other specific features like categories, structured indication (please click on drug name to see the features I will use).
我编写了以下代码,但问题是我无法使我的代码处理分页(因为您看到的页面超过2000页!).
I wrote the following code but the issue that I can't make my code handle pagination (as you see there more than 2000 pages!).
import requests
from bs4 import BeautifulSoup
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
for link in soup.select('name-head a'):
href = 'https://www.drugbank.ca/drugs/' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text, "lxml")
g_data = soup.select('div.content-container')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[5].findAll('td',{'class':'col-md-2 col-sm-4'})
[0].text
except:
pass
print item_url
drug_data()
如何抓取所有数据并正确处理分页?
How can I scrape all of the data and handle pagination properly?
推荐答案
此页面对所有页面使用几乎相同的url,因此您可以使用 for
循环来生成它们
This page use almost the same url for all pages so you can use for
loop to generate them
def drug_data(page_number):
url = 'https://www.drugbank.ca/drugs/?page=' + str(page_number)
... rest ...
# --- later ---
for x in range(1, 2001):
drug_data(x)
或者使用 while
和 try/except
获得超过2000页的页面
Or using while
and try/except
to get more then 2000 pages
# --- later ---
page = 0
while True:
try:
page += 1
drug_data(page)
except Exception as ex:
print(ex)
print("probably last page:", page)
break # exit `while` loop
您还可以在HTML中找到下一页的网址
You can also find url to next page in HTML
<a rel="next" class="page-link" href="/drugs?approved=1&c=name&d=up&page=2">›</a>
因此您可以使用 BeautifulSoup
来获取并使用此链接.
so you can use BeautifulSoup
to get this link and use it.
它显示当前网址,找到指向下一页的链接(使用 class ="page-link" rel ="next"
)并加载
It displays current url, finds link to next page (using class="page-link" rel="next"
) and loads it
import requests
from bs4 import BeautifulSoup
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
#data = soup.select('name-head a')
#for link in data:
# href = 'https://www.drugbank.ca/drugs/' + link.get('href')
# pages_data(href)
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
顺便说一句:永远不要使用 except:pass
,因为您可能会遇到意想不到的错误,并且也不知道为什么它不起作用.更好的显示错误
BTW: never use except:pass
because you can have error which you didn't expect and you will not know why it doesn't work. Better display error
except Exception as ex:
print('Error:', ex)
这篇关于使用BeautifulSoup分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!