如何使用python在nike页面上抓网所有鞋子 [英] How to webscrape all shoes on nike page using python

查看:151
本文介绍了如何使用python在nike页面上抓网所有鞋子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试在 https://www.nike上对所有鞋子进行网上抓取. com/w/mens-shoes-nik1zy7ok .我如何刮擦所有鞋子,包括在向下滚动页面时加载的鞋子?

I am trying to webscrape all the shoes on https://www.nike.com/w/mens-shoes-nik1zy7ok. How do I scrape all the shoes including the shoes that load as you scroll down the page?

我要获取的确切信息是在div元素中,其类别为"product-card__body", 如下:

The exact information I want to obtain is inside the div elements with the class "product-card__body" as follows:

<div class="product-card__body " data-el-type="Card"><figure><a class="product-card__link-overlay" href="https://www.nike.com/t/air-force-1-07-mens-shoe-TjqcX1/CJ0952-001">Nike Air Force 1 '07</a><a class="product-card__img-link-overlay" href="https://www.nike.com/t/air-force-1-07-mens-shoe-TjqcX1/CJ0952-001" aria-describedby="Nike Air Force 1 '07" data-el-type="Hero"><div class="image-loader css-zrrhrw product-card__hero-image is--loaded"><picture><source srcset="https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" media="(min-width: 1024px)"><source srcset="https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" media="(max-width: 1023px) and (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi)"><source srcset="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" media="(max-width: 1023px)"><img src="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" alt="Nike Air Force 1 '07 Men's Shoe"></picture></div></a><div class="product-card__info"><div class="product_msg_info"><div class="product-card__titles"><div class="product-card__title " id="Nike Air Force 1 '07">Nike Air Force 1 '07</div><div class="product-card__subtitle ">Men's Shoe</div></div></div><div class="product-card__count-wrapper show--all"><div class="product-card__count-item"><button type="button" aria-expanded="false" class="product-card__colorway-btn"><div aria-label="Available in 3 Colors" aria-describedby="Nike Air Force 1 '07" class="product-card__product-count "><span>3 Colors</span></div></button></div></div><div class="product-card__price-wrapper "><div class="product-card__price"><div><div class="product-price css-11s12ax is--current-price" data-test="product-price">$90</div></div></div></div></div></figure></div>

这是我正在使用的代码:

Here is the code I am using:

    html_data = requests.get("https://www.nike.com/w/mens-shoes-nik1zy7ok").text
    shoes = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))

现在,它仅检索最初加载到页面上的鞋子.如何获得其余的鞋子并将其附加到shoe变量中?

Right now it only retrieves the shoes that initially load on the page. How do I get the rest of the shoes as well and append that to the shoes variable?

推荐答案

通过检查网站的API调用,您可以找到以 https://api.nike.com/.此URL也存储在您已经用于获取前几对产品的INITIAL_REDUX_STATE中.因此,我只是扩展您的方法:

By examining the API calls made by the website you can find a cryptic URL starting with https://api.nike.com/. This URL is also stored in the INITIAL_REDUX_STATE that you already used to get the first couple of products. So, I simply extend your approach:

import requests
import json
import re

# your product page
uri = 'https://www.nike.com/w/mens-shoes-nik1zy7ok'

base_url = 'https://api.nike.com'
session = requests.Session()

def get_lazy_products(stub, products):
"""Get the lazily loaded products."""
    response = session.get(base_url + stub).json()
    next_products = response['pages']['next']
    products += response['objects']
    if next_products:
        get_lazy_products(next_products, products)
    return products

# find INITIAL_REDUX_STATE
html_data = session.get(uri).text
redux = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))

# find the initial products and the api entry point for the recursive loading of additional products
wall = redux['Wall']
initial_products = re.sub('anchor=[0-9]+', 'anchor=0', wall['pageData']['next'])

# find all the products
products = get_lazy_products(initial_products, [])

# Optional: filter by id to get a list with unique products
cloudProductIds = set()
unique_products = []
for product in products:
    try:
        if not product['id'] in cloudProductIds:
            cloudProductIds.add(product['id'])
            unique_products.append(product)
    except KeyError:
        print(product)

api还会返回产品总数,尽管这个数字似乎有所不同,并且取决于api URL中的count参数.

The api also returns the total number of products, though this number seems to vary and depend on the count parameter in the api`s URL.

您需要帮助解析或汇总结果吗?

Do you need help parsing or aggregating the results?

这篇关于如何使用python在nike页面上抓网所有鞋子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆