尝试抓取多个网址时,只能抓取1个.(以任何方式生成多个URL列表)? [英] Trying to Scrape multiple urls, can only scrape 1. (Any way to generate multiple URL list)?

查看:71
本文介绍了尝试抓取多个网址时,只能抓取1个.(以任何方式生成多个URL列表)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import csv
import requests
from bs4 import BeautifulSoup

urls = ["https://www.medplusmedicalsupply.com/exam-and-diagnostic?product_list_limit=25", "https://www.medplusmedicalsupply.com/exam-and-diagnostic?p=2&product_list_limit=25"]
for url in urls:
    html = requests.get(urls).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll('div', {"class": "product details product-item-details"})
    all_product = []

for product in products:
    product_details = dict()
    product_details['name'] = product.find('a').text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
    product_details['brand'] = product.find('div', {'class': 'value'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
    product_details['packaging'] = product.find('div', {'class': 'pack'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
    product_details['availability'] = product.find('div', {'class': 'avail pack'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
    product_details['price'] = product.find('span', {'class': 'price'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
    product_details['packaging'] = product_details['packaging'][9:] # here we're cutting redundant part of string "Brand: \n\n"
    product_details['availability'] = product_details['availability'][16:] # here we're cutting redundant part of string "Availability: \n\n"
    all_product.append(product_details)

print(all_product)

with open('products.csv', 'w+') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerow(['Name', 'Brand', 'Packaging', 'Availability', 'Price'])
    for product in all_product:
        writer.writerow([product['name'], product['brand'],product['packaging'], product['availability'], product['price']])

这是尝试两个URL时的错误代码:

Here is the error code when trying two URLs:

InvalidSchema: No connection adapters were found for '['https://www.medplusmedicalsupply.com/exam-and-diagnostic?product_list_limit=25', 'https://www.medplusmedicalsupply.com/exam-and-diagnostic?p=2&product_list_limit=25']'

我一直想知道是否有一种生成无限页面的方法,而不是将URL手动放置在urls变量中.我要搜寻的网站有成千上万个具有许多页面的产品.感谢您的帮助!

I'm always wondering if there is a way to generate, infinite pages, instead of manually placing the URLs in the urls variable. The website I am looking to scrape has thousands of products with many pages. Thanks for any help!

推荐答案

您几乎完成了代码,但是如果必须访问多个URL并保存所有数据,则应遵循此步骤.

You almost finished your code, but you should follow this, if you have to access multi-url and save all data.

  • 保存每个汤
  • 遍历每种汤并将数据保存到列表中
  • 写入数据

我的完整代码

import csv
import requests
from bs4 import BeautifulSoup

urls = ["https://www.medplusmedicalsupply.com/exam-and-diagnostic?product_list_limit=25",
        "https://www.medplusmedicalsupply.com/exam-and-diagnostic?p=2&product_list_limit=25"]

all_product = []

for index,url in enumerate(urls):
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll('div', {"class": "product details product-item-details"})
    all_product.append(products)

resultset = []

for products in all_product:
    for product in products:
        product_details = dict()
        product_details['name'] = product.find('a').text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
        product_details['brand'] = product.find('div', {'class': 'value'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
        product_details['packaging'] = product.find('div', {'class': 'pack'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
        product_details['availability'] = product.find('div', {'class': 'avail pack'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
        product_details['price'] = product.find('span', {'class': 'price'}).text.strip('\n\r\t": ').strip('\n\r\t": ').strip('\n\r\t": ')
        product_details['packaging'] = product_details['packaging'][9:] # here we're cutting redundant part of string "Brand: \n\n"
        product_details['availability'] = product_details['availability'][16:] # here we're cutting redundant part of string "Availability: \n\n"
        resultset.append(product_details)


with open('products.csv', 'w+',,newline='') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerow(['Name', 'Brand', 'Packaging', 'Availability', 'Price'])
    for product in resultset:
        writer.writerow([product['name'], product['brand'],product['packaging'], product['availability'], product['price']])

这篇关于尝试抓取多个网址时,只能抓取1个.(以任何方式生成多个URL列表)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆