网页抓取时选择店铺位置 [英] Selecting a store location when webscraping

查看:36
本文介绍了网页抓取时选择店铺位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取一个杂货网站(

如果您有任何问题,请告诉我:)

I am scraping a grocery website (https://www.paknsaveonline.co.nz) to do some meal planning before I shop. The price of products varies with the location of the store. I want to extract prices from my local store (Albany).

I am new to web-scraping, but I am assuming my code must

  1. change the default store to my local store (Albany, using this url: https://www.paknsaveonline.co.nz/CommonApi/Store/ChangeStore?storeId=65defcf2-bc15-490e-a84f-1f13b769cd22)
  2. maintain a single requests "session", to ensure I scrape all of my products from the same store site.

My scraping code successfully scrapes the price of broccoli, but the price does not align with the price from my local store. At the time of posting my scraped price for broccoli is $1.99, but when I manually check the price at the Albany store, the price is $0.99. I assume my code to switch to the correct store isn't working as intended.

Can anyone point out what I am doing wrong and suggest a solution?

Environment details:

  • requests==2.23.0
  • beautifulsoup4==4.6.3
  • Python 3.7.10

Code below, with an associated link to Google Colab file.

import requests
from bs4 import BeautifulSoup as bs
import re

dollars_pattern = '>([0-9][0-9]?)'
cents_pattern = '>([0-9][0-9])'
url = 'https://www.paknsaveonline.co.nz/CommonApi/Store/ChangeStore?storeId=65defcf2-bc15-490e-a84f-1f13b769cd22'
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}   }

with requests.session() as s:
  #I assume this url changes the store (200 response)
  s.get(url)
  #use the same session to return broccoli price
  r = s.get('https://www.paknsaveonline.co.nz/product/5039956_ea_000pns?name=broccoli')
  soup = bs(r.content,'html.parser')
  cents =  str(soup.find_all('span', {'class': "fs-price-lockup__cents"}))
  dollars =  str(soup.find_all('span', {'class': "fs-price-lockup__dollars"}))
  centsprice =re.findall(cents_pattern, cents)
  dollarsprice = re.findall(dollars_pattern, dollars)
  print(dollarsprice, centsprice)

Google Colab file

解决方案

When I saw the actual requests of that you need to first get some cookies from base URL and then you can change the store for that session you cant directly modify the store by calling that URL so first you call base URL and then change store URL and then again call the base URL to get 0.99cents price.

import requests
from bs4 import BeautifulSoup as bs
import re

dollars_pattern = '>([0-9][0-9]?)'
cents_pattern = '>([0-9][0-9])'


url = 'https://www.paknsaveonline.co.nz/CommonApi/Store/ChangeStore?storeId=65defcf2-bc15-490e-a84f-1f13b769cd22'
baseurl="https://www.paknsaveonline.co.nz/product/5039956_ea_000pns?name=broccoli"
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

with requests.session() as s:
  #I assume this url changes the store (200 response)
  s.get(baseurl)
  s.get(url)
  #use the same session to return broccoli price
  r = s.get(baseurl)
  soup = bs(r.content,'html.parser')
  cents =  str(soup.find_all('span', {'class': "fs-price-lockup__cents"}))
  dollars =  str(soup.find_all('span', {'class': "fs-price-lockup__dollars"}))
  centsprice =re.findall(cents_pattern, cents)
  dollarsprice = re.findall(dollars_pattern, dollars)
  print(dollarsprice, centsprice)

Let me know if you have any questions :)

这篇关于网页抓取时选择店铺位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆