如何从一个站点抓取多个页面 [英] how to scrape multiple pages from one site

查看:54
本文介绍了如何从一个站点抓取多个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从一个站点抓取多个页面.这样的模式:

I want to scrap multiple pages from one site.the pattern like this:

https://www.example.com/S1-3-1.html https://www.example.com/S1-3-2.html https://www.example.com/S1-3-3.html https://www.example.com/S1-3-4.html https://www.example.com/S1-3-5.html.

我尝试了三种方法一次抓取所有这些页面,但每种方法都只抓取第一页.我在下面展示了代码,任何人都可以检查并告诉我问题所在,我们将不胜感激.

I tried three method to scrape all of these pages once, but every method only scrape the first page. I show the code below, and anyone can check and tell me what is the problem will be highly appreciated.

 ===============method 1====================
    import requests  
    for i in range(5):      # Number of pages plus one 
        url = "https://www.example.com/S1-3-{}.html".format(i)
        r = requests.get(url)
    from bs4 import BeautifulSoup  
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    ===============method 2=============
    import urllib2,sys
    from bs4 import BeautifulSoup
    for numb in ('1', '5'):
        address = ('https://www.example.com/S1-3-' + numb + '.html')
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html,'html.parser')
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    =============method 3==============
    import requests 
    from bs4 import BeautifulSoup  
    url = 'https://www.example.com/S1-3-1.html'
    for round in range(5):
        res = requests.get(url)
        soup = BeautifulSoup(res.text,'html.parser')
        results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
        paging = soup.select('div.paging a')
        next_url = 'https://www.example.com/'+paging[-1]['href'] # paging[-1]['href'] is next page button on the page 
        url = next_url

我检查了一些答案并进行了检查,但不是循环问题,请检查下图,这只是第一页结果.真是气死我几天了请看照片:仅第一页结果结果图2

I checked some answers and checked, but it is not loop problem, please check image shown below,it is only first page results. it is really me annoyed several days please see photo:only first page results, results picture 2

推荐答案

你的缩进有问题.

试试(方法一)

from bs4 import BeautifulSoup 
import requests

for i in range(1, 6):      # Number of pages plus one 
    url = "https://www.example.com/S1-3-{}.html".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})

这篇关于如何从一个站点抓取多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆