从使用Python使用AJAX分页的网站上使用BeautifulSoup进行抓取 [英] Scrape with BeautifulSoup from site that uses AJAX pagination using Python

查看:193
本文介绍了从使用Python使用AJAX分页的网站上使用BeautifulSoup进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编码和Python还是很陌生,所以如果这是一个愚蠢的问题,我深表歉意.我想要一个脚本,该脚本遍历所有19,000个搜索结果页面,并为所有URL刮取每个页面.我已经完成了所有的剪贴工作,但无法弄清楚该页面如何使用AJAX进行分页的事实.通常,我只会使用url循环以捕获每个搜索结果,但这是不可能的.这是网页: http://www.heritage.org/research /all-research.aspx?nomobile&categories=report

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report

这是我到目前为止的脚本:

This is the script I have so far:

with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
    page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
    soup = BeautifulSoup(page)
    snippet = soup.find_all('a', attrs={'item-title'})
    for a in snippet:
        logfile.write ("http://www.heritage.org" + a.get('href') + "\n")

print "Done collecting urls"

很明显,它只刮取了结果的第一页,仅此而已.

Obviously, it scrapes the first page of results and nothing more.

我看了几个相关的问题,但似乎没有一个使用Python,或者至少没有以我能理解的方式使用.预先感谢您的帮助.

And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.

推荐答案

出于完整性考虑,尽管您可以尝试访问POST请求并找到一种方法来访问下一页,如我在评论中所建议的,如果可以选择的话,使用 Selenium 可以很容易地实现您想要的目标.

For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.

以下是使用的简单解决方案:

Here is a simple solution using Selenium for your question:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

# uncomment if using Firefox web browser
driver = webdriver.Firefox()

# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()

url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)

# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
    while True:
        try:
            # sleep here to allow time for page load
            sleep(5)
            # grab the Next button if it exists
            btn_next = driver.find_element_by_class_name('next')
            # find all item-title a href and write to file
            links = driver.find_elements_by_class_name('item-title')
            print "Page: {} -- {} urls to write...".format(pages, len(links))
            for link in links:
                f.write(link.get_attribute('href')+'\n')
            # Exit if no more Next button is found, ie. last page
            if btn_next is None:
                print "crawling completed."
                exit(-1)
            # otherwise click the Next button and repeat crawling the urls
            pages += 1
            btn_next.send_keys(Keys.RETURN)
        # you should specify the exception here
        except:
            print "Error found, crawling stopped"
            exit(-1)

希望这会有所帮助.

这篇关于从使用Python使用AJAX分页的网站上使用BeautifulSoup进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆