Python Web抓取-遍历所有类别和子类别 [英] Python web scraping - Loop through all categories and subcategories
问题描述
我正在尝试检索零售网站中的所有类别和子类别.进入类别后,我就可以使用BeautifulSoup提取类别中的每个产品.但是,我正在为类别循环而苦苦挣扎.我将其用作测试网站 https://www.uniqlo.com/us/en/妇女
I am trying to retrieve all categories and subcategories within a retail website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggle with the loop for categories. I'm using this as a test website https://www.uniqlo.com/us/en/women
如何遍历网站左侧的每个类别以及子类别?问题是您必须在网站显示所有子类别之前单击类别.我想将类别/子类别中的所有产品提取到一个csv文件中.这是我到目前为止的内容:
How do I loop through each category as well as the subcategories on the left side of the website? The problem is that you would have to click on the category before the website displays all the subcategories. I would like to extract all products within the category/subcategory into a csv file. This is what I have so far:
import bs4
import json
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myurl = 'https://www.uniqlo.com/us/en/women/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
filename = "products.csv"
file = open(filename,"w",newline='')
product_list = []
containers = page_soup.findAll("li",{"class" : lambda L: L and
L.startswith('grid-tile')}) #Find all li with class: grid-tile
for container in containers:
product_container = container.findAll("div",{"class":"product-swatches"})
product_names = product_container[0].findAll("li")
for i in range(len(product_names)):
try:
product_name = product_names[i].a.img.get("alt")
product_mod_name = product_name.split(',')[0].lstrip()
print(product_mod_name)
except:
product_name = ''
i +=1
product = [product_mod_name]
print(product)
product_list.append(product)
import csv
with open('products.csv','a',newline='') as file:
writer=csv.writer(file)
for row in product_list:
writer.writerow(row)
推荐答案
您可以尝试使用此脚本.它将经历产品的不同类别和子类别,并解析它们的标题和价格.有几种具有相同名称的产品,它们之间唯一的区别是颜色.因此,请勿将其视为重复项.我已经以非常紧凑的方式编写了脚本,因此请根据您的舒适程度对其进行拉伸:
You can try this script. It will go through different categories and subcategories of products and parse the title and price of them. There are several products with same names and the only difference between them are colors. So, don't count them as duplicate. I've written the script in a very compact manner so stretch it as per your comfortability:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.uniqlo.com/us/en/women')
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("#category-level-1 .refinement-link"):
page = requests.get(items['href'])
broth = BeautifulSoup(page.text,"lxml")
for links in broth.select("#category-level-2 .refinement-link"):
req = requests.get(links['href'])
sauce = BeautifulSoup(req.text,"lxml")
for data in sauce.select(".product-tile-info"):
title = data.select(".name-link")[0].text
price = ' '.join([item.text for item in data.select(".product-pricing span")])
print(title.strip(),price.strip())
结果如下:
WOMEN CASHMERE CREW NECK SWEATER $79.90
Women Extra Fine Merino Crew Neck Sweater $29.90 $19.90
WOMEN KAWS X PEANUTS LONG-SLEEVE HOODED SWEATSHIRT $19.90
这篇关于Python Web抓取-遍历所有类别和子类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!