从使用 Python 使用 AJAX 分页的站点使用 BeautifulSoup 进行抓取 [英] Scrape with BeautifulSoup from site that uses AJAX pagination using Python

查看:38
本文介绍了从使用 Python 使用 AJAX 分页的站点使用 BeautifulSoup 进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编码和 Python 还很陌生,所以如果这是一个愚蠢的问题,我深表歉意.我想要一个脚本,它遍历所有 19,000 个搜索结果页面并为所有 url 抓取每个页面.我已经完成了所有的抓取工作,但无法弄清楚如何处理页面使用 AJAX 进行分页的事实.通常我只是用 url 循环来捕获每个搜索结果,但这是不可能的.这是页面:http://www.heritage.org/research/all-research.aspx?nomobile&categories=report

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report

这是我目前的脚本:

with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
    page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
    soup = BeautifulSoup(page)
    snippet = soup.find_all('a', attrs={'item-title'})
    for a in snippet:
        logfile.write ("http://www.heritage.org" + a.get('href') + "
")

print "Done collecting urls"

显然,它只抓取了第一页的结果,仅此而已.

Obviously, it scrapes the first page of results and nothing more.

而且我查看了一些相关问题,但似乎没有一个使用 Python,或者至少没有以我能理解的方式使用.预先感谢您的帮助.

And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.

推荐答案

为了完整起见,虽然您可以尝试访问 POST 请求并找到访问下一页的方法,就像我在评论中建议的那样,如果有替代方案,使用 Selenium 将很容易实现您想要的.

For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.

这是使用 Selenium 解决您的问题的简单解决方案:

Here is a simple solution using Selenium for your question:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

# uncomment if using Firefox web browser
driver = webdriver.Firefox()

# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()

url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)

# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
    while True:
        try:
            # sleep here to allow time for page load
            sleep(5)
            # grab the Next button if it exists
            btn_next = driver.find_element_by_class_name('next')
            # find all item-title a href and write to file
            links = driver.find_elements_by_class_name('item-title')
            print "Page: {} -- {} urls to write...".format(pages, len(links))
            for link in links:
                f.write(link.get_attribute('href')+'
')
            # Exit if no more Next button is found, ie. last page
            if btn_next is None:
                print "crawling completed."
                exit(-1)
            # otherwise click the Next button and repeat crawling the urls
            pages += 1
            btn_next.send_keys(Keys.RETURN)
        # you should specify the exception here
        except:
            print "Error found, crawling stopped"
            exit(-1)

希望这会有所帮助.

这篇关于从使用 Python 使用 AJAX 分页的站点使用 BeautifulSoup 进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆