如何使用不变的 URL 抓取多个页面 - Python 3 [英] How to scrape multiple pages with an unchanging URL - Python 3

查看:24
本文介绍了如何使用不变的 URL 抓取多个页面 - Python 3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近接触了网页抓取并尝试抓取各种网页.目前,我正在尝试抓取以下站点 -

完成后,点击下一页按钮.您将看到以下文件.

点击那个文件.在常规块中,您会看到我们需要的这两项内容.

向下滚动,在表单数据选项卡中,您可以看到 3 个变量为

在这里,您可以看到更改 pageIndex 的值将提供所需的所有页面.

现在,我们已经获得了所有需要的数据,我们可以为 URL http://www.pizzahut.com.cn/StoreList/Index<编写一个 POST 方法/code> 使用上述数据.

代码:

我将向您展示抓取前 2 页的代码,您可以通过更改 range() 抓取任意数量的页面.

 for page_no in range(1, 3):数据 = {'pageIndex': page_no,'页面大小':10,'keyword': '输入餐厅地址或餐厅名称'}page = requests.post('http://www.pizzahut.com.cn/StoreList/Index', data=data)汤 = BeautifulSoup(page.text, 'html.parser')打印('页面',page_no)对于soup.find_all('div',class_='re_RNew') 中的行:name = row.find('p',class_='re_NameNew').stringinfo = row.find('input').get('value')location = info.split('|')location_data = location[0].split(',')经度 = location_data[0]纬度 = location_data[1]打印(经度,纬度)

输出:

第 1 页31.085877 121.39917631.271117 121.58757731.098122 121.41339631.331458 121.44018331.094581 121.50365431.270737000 121.48117800031.138214 121.38694330.915685 121.48207931.279029 121.52925531.168283 121.283322第2页31.388674 121.3591831.231706 121.47264431.094857 121.21996131.228564 121.51660931.235717 121.47869231.288498 121.52188231.155139 121.42888531.235249 121.47463930.728829 121.34142931.260372 121.343066


注意:您可以通过更改 pageSize 的值(目前为 10)来更改每页的结果.

I recently got in touch with web scraping and tried to web scrape various pages. For now, I am trying to scrape the following site - http://www.pizzahut.com.cn/StoreList

So far I've used selenium to get the longitude and latitude scraped. However, my code right now only extracts the first page. I know there is a dynamic web scraping that executes javascript and loads different pages, but had hard time trying to find a right solution. I was wondering if there's a way to access the other 49 pages or so, because when I click next page the URL does not change because it is set, so I cannot just iterate over a different URL each time

Following is my code so far:

import os
import requests
import csv
import sys
import time
from bs4 import BeautifulSoup

page = requests.get('http://www.pizzahut.com.cn/StoreList')

soup = BeautifulSoup(page.text, 'html.parser')

for row in soup.find_all('div',class_='re_RNew'):
    name = row.find('p',class_='re_NameNew').string
    info = row.find('input').get('value')
    location = info.split('|')
    location_data = location[0].split(',')
    longitude = location_data[0]
    latitude = location_data[1]
    print(longitude, latitude)

Thank you so much for helping out. Much appreciated

解决方案

Steps to get the data:

Open the developer tools in your browser (for Google Chrome it's Ctrl+Shift+I). Now, go to the XHR tab which is located inside the Network tab.

After doing that, click on the next page button. You'll see the following file.

Click on that file. In the General block, you'll see these 2 things that we need.

Scrolling down, in the Form Data tab, you can see the 3 variables as

Here, you can see that changing the value of pageIndex will give all the pages required.

Now, that we've got all the required data, we can write a POST method for the URL http://www.pizzahut.com.cn/StoreList/Index using the above data.

Code:

I'll show you the code to scrape first 2 pages, you can scrape any number of pages you want by changing the range().

for page_no in range(1, 3):
    data = {
        'pageIndex': page_no,
        'pageSize': 10,
        'keyword': '输入餐厅地址或餐厅名称'
    }
    page = requests.post('http://www.pizzahut.com.cn/StoreList/Index', data=data)
    soup = BeautifulSoup(page.text, 'html.parser')

    print('PAGE', page_no)
    for row in soup.find_all('div',class_='re_RNew'):
        name = row.find('p',class_='re_NameNew').string
        info = row.find('input').get('value')
        location = info.split('|')
        location_data = location[0].split(',')
        longitude = location_data[0]
        latitude = location_data[1]
        print(longitude, latitude)

Output:

PAGE 1
31.085877 121.399176
31.271117 121.587577
31.098122 121.413396
31.331458 121.440183
31.094581 121.503654
31.270737000 121.481178000
31.138214 121.386943
30.915685 121.482079
31.279029 121.529255
31.168283 121.283322
PAGE 2
31.388674 121.35918
31.231706 121.472644
31.094857 121.219961
31.228564 121.516609
31.235717 121.478692
31.288498 121.521882
31.155139 121.428885
31.235249 121.474639
30.728829 121.341429
31.260372 121.343066


Note: You can change the results per page by changing the value of pageSize (currently it's 10).

这篇关于如何使用不变的 URL 抓取多个页面 - Python 3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆