Python Web抓取-遍历所有类别和子类别 [英] Python web scraping - Loop through all categories and subcategories

查看:39
本文介绍了Python Web抓取-遍历所有类别和子类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检索零售网站中的所有类别和子类别.进入类别后,我就可以使用BeautifulSoup提取类别中的每个产品.但是,我正在为类别循环而苦苦挣扎.我将其用作测试网站 https://www.uniqlo.com/us/en/妇女

I am trying to retrieve all categories and subcategories within a retail website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggle with the loop for categories. I'm using this as a test website https://www.uniqlo.com/us/en/women

如何遍历网站左侧的每个类别以及子类别?问题是您必须在网站显示所有子类别之前单击类别.我想将类别/子类别中的所有产品提取到一个csv文件中.这是我到目前为止的内容:

How do I loop through each category as well as the subcategories on the left side of the website? The problem is that you would have to click on the category before the website displays all the subcategories. I would like to extract all products within the category/subcategory into a csv file. This is what I have so far:

import bs4
import json
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myurl = 'https://www.uniqlo.com/us/en/women/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
filename = "products.csv"
file = open(filename,"w",newline='')
product_list = []

containers = page_soup.findAll("li",{"class" : lambda L: L and 
L.startswith('grid-tile')})   #Find all li with class: grid-tile

for container in containers: 

product_container = container.findAll("div",{"class":"product-swatches"})   
product_names = product_container[0].findAll("li")

    for i in range(len(product_names)):

    try:
        product_name = product_names[i].a.img.get("alt")
        product_mod_name = product_name.split(',')[0].lstrip()
        print(product_mod_name)
    except:
        product_name = ''

    i +=1    

product = [product_mod_name]
print(product)    
product_list.append(product)

import csv

with open('products.csv','a',newline='') as file:        
    writer=csv.writer(file)
    for row in product_list:
        writer.writerow(row)

推荐答案

您可以尝试使用此脚本.它将经历产品的不同类别和子类别,并解析它们的标题和价格.有几种具有相同名称的产品,它们之间唯一的区别是颜色.因此,请勿将其视为重复项.我已经以非常紧凑的方式编写了脚本,因此请根据您的舒适程度对其进行拉伸:

You can try this script. It will go through different categories and subcategories of products and parse the title and price of them. There are several products with same names and the only difference between them are colors. So, don't count them as duplicate. I've written the script in a very compact manner so stretch it as per your comfortability:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.uniqlo.com/us/en/women')
soup = BeautifulSoup(res.text, "lxml")

for items in soup.select("#category-level-1 .refinement-link"):
    page = requests.get(items['href'])
    broth = BeautifulSoup(page.text,"lxml")

    for links in broth.select("#category-level-2 .refinement-link"):
        req = requests.get(links['href'])
        sauce = BeautifulSoup(req.text,"lxml")

        for data in sauce.select(".product-tile-info"):
            title = data.select(".name-link")[0].text
            price = ' '.join([item.text for item in data.select(".product-pricing span")])
            print(title.strip(),price.strip())

结果如下:

WOMEN CASHMERE CREW NECK SWEATER $79.90
Women Extra Fine Merino Crew Neck Sweater $29.90 $19.90
WOMEN KAWS X PEANUTS LONG-SLEEVE HOODED SWEATSHIRT $19.90

这篇关于Python Web抓取-遍历所有类别和子类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆