BeautifulSoup循环到达网址 [英] BeautifulSoup looping through urls

查看:187
本文介绍了BeautifulSoup循环到达网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想收获一些棋类游戏,并得到了基本知识做了一些帮助更换预定主要功能礼貌看起来是这样的:

I'm trying to harvest some chess games and got the basics done courtesy of some help here.The main function looks something like:

import requests
import urllib2
from bs4 import BeautifulSoup

r = requests.get(userurl)
soup = BeautifulSoup(r.content)
gameids= []
for link in soup.select('a[href^=/livechess/game?id=]'):
    gameid = link['href'].split("?id=")[1]
    gameids.append(int(gameid))
    return gameids

基本上什么情况是,我去的网址为特定用户如的http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru,grab HTML和刮gameids.This工作正常一页。
然而,一些用户已经打了很多场比赛,因为只有50场比赛,每页显示,他们的游戏在多个pages.e.g上市
http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru&page=2 (或3/4/5等)
这就是我stuck.How我可以通过网页循环并获得证书?

Basically what happens is that I go to the url for a specific user such as http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru,grab the html and scrape the gameids.This works fine for one page. However some users have played lots of games and since only 50 games are displayed per page, their games are listed on multiple pages.e.g http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru&page=2 (or 3/4/5 etc) That's where I'm stuck.How can I loop through the pages and get the ids?

推荐答案

通过一个无限循环按照分页,并按照下一步链接,直到它是找不到的。

Follow the pagination by making an endless loop and follow the "Next" link until it is not found.

在换句话说,从

下面的下一步直到链接:

following "Next" link until:

工作code:

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup

base_url = 'http://www.chess.com/'
game_ids = []

next_page = 'http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru'
while True:
    soup = BeautifulSoup(requests.get(next_page).content)

    # collect the game ids
    for link in soup.select('a[href^=/livechess/game?id=]'):
        gameid = link['href'].split("?id=")[1]
        game_ids.append(int(gameid))

    try:
        next_page = urljoin(base_url, soup.select('ul.pagination li.next-on a')[0].get('href'))
    except IndexError:
        break  # exiting the loop if "Next" link not found

print game_ids

有关您所提供的网址( ),它会打印你的游戏224 IDS名单从所有页面。

For the URL you've provided (Hikaru GM), it would print you a list of 224 game ids from all pages.

这篇关于BeautifulSoup循环到达网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆