BeautifulSoup循环到达网址 [英] BeautifulSoup looping through urls
问题描述
我想收获一些棋类游戏,并得到了基本知识做了一些帮助更换预定主要功能礼貌看起来是这样的:
I'm trying to harvest some chess games and got the basics done courtesy of some help here.The main function looks something like:
import requests
import urllib2
from bs4 import BeautifulSoup
r = requests.get(userurl)
soup = BeautifulSoup(r.content)
gameids= []
for link in soup.select('a[href^=/livechess/game?id=]'):
gameid = link['href'].split("?id=")[1]
gameids.append(int(gameid))
return gameids
基本上什么情况是,我去的网址为特定用户如的http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru,grab HTML和刮gameids.This工作正常一页。
然而,一些用户已经打了很多场比赛,因为只有50场比赛,每页显示,他们的游戏在多个pages.e.g上市
的http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru&page=2 (或3/4/5等)
这就是我stuck.How我可以通过网页循环并获得证书?
Basically what happens is that I go to the url for a specific user such as http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru,grab the html and scrape the gameids.This works fine for one page. However some users have played lots of games and since only 50 games are displayed per page, their games are listed on multiple pages.e.g http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru&page=2 (or 3/4/5 etc) That's where I'm stuck.How can I loop through the pages and get the ids?
推荐答案
通过一个无限循环按照分页,并按照下一步链接,直到它是找不到的。
Follow the pagination by making an endless loop and follow the "Next" link until it is not found.
在换句话说,从
下面的下一步直到链接:
following "Next" link until:
工作code:
from urlparse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.chess.com/'
game_ids = []
next_page = 'http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru'
while True:
soup = BeautifulSoup(requests.get(next_page).content)
# collect the game ids
for link in soup.select('a[href^=/livechess/game?id=]'):
gameid = link['href'].split("?id=")[1]
game_ids.append(int(gameid))
try:
next_page = urljoin(base_url, soup.select('ul.pagination li.next-on a')[0].get('href'))
except IndexError:
break # exiting the loop if "Next" link not found
print game_ids
有关您所提供的网址( ル
转),它会打印你的游戏224 IDS名单从所有页面。
For the URL you've provided (Hikaru
GM), it would print you a list of 224 game ids from all pages.
这篇关于BeautifulSoup循环到达网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!