如何从操纵延迟加载方法的网页中获取所有数据? [英] How to get all the data from a webpage manipulating lazy-loading method?

查看:21
本文介绍了如何从操纵延迟加载方法的网页中获取所有数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 python 编写了一些脚本,使用 selenium 从 redmart 网站上抓取不同产品的名称和价格.我的爬虫点击一个链接,转到它的目标页面,从那里解析数据.但是,我在使用此爬网程序时面临的问题是,由于网页的加载速度较慢,因此它从页面中抓取的项目很少.如何从控制延迟加载过程的每个页面获取所有数据?我尝试使用执行脚本"方法,但我做错了.这是我正在尝试的脚本:

I've written some script in python using selenium to scrape name and price of different products from redmart website. My scraper clicks on a link, goes to its target page, parses data from there. However, the issue I'm facing with this crawler is it scrapes very few items from a page because of the webpage's slow-loading method. How can I get all the data from each page controlling the lazy-loading process? I tried with "execute script" method but i did it wrongly. Here is the script I'm trying with:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://redmart.com/bakery")
wait = WebDriverWait(driver, 10)

counter = 0    
while True:

    try:
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li.image-facets-pill")))
        driver.find_elements_by_css_selector('img.image-facets-pill-image')[counter].click()      
        counter += 1    
    except IndexError:
        break 

    # driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    for elems in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productPreview"))):
        name = elems.find_element_by_css_selector('h4[title] a').text
        price = elems.find_element_by_css_selector('span[class^="ProductPrice__"]').text
        print(name, price)

    driver.back()
driver.quit() 

推荐答案

我想你可以使用 Selenium 来解决这个问题,但如果速度是你关心的问题 @Andersson 在 Stackoverflow 上的另一个问题中为您制作了代码,好吧,您应该复制 API 调用,该站点使用该调用并从中提取数据JSON - 就像网站一样.

I guess you could use Selenium for this but if speed is your concern aften @Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.

如果您使用 Chrome Inspector,您会看到外部 while 循环(原始代码中的 try 块)中的每个类别的站点都调用一个 API,该 API 返回站点的整体类别.所有这些数据都可以这样检索:

If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:

categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()

对于下一个 API 调用,您需要获取有关面包店内容的 uri.可以这样做:

For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:

bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]

Uris 现在将是一个字符串列表 (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) 您将传递给 Chrome Inspector 发现的另一个 API,以及该网站用来加载内容的 API.

Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.

此 API 具有以下形式(默认返回较小的 pageSize,但我将其提高到 500 以确保您在一个请求中获得所有数据):

This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):

items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'

for uri in uris:
    r = requests.get(items_API.format(uri)).json()
    products = r['products']
    for product in products:
        name = product['title']
        # testing for promo_price - if its 0.0 go with the normal price
        price = product['pricing']['promo_price']
        if price == 0.0:
            price = product['pricing']['price']
        print("Name: {}. Price: {}".format(name, price))

如果你想坚持使用硒,你可以插入这样的东西来处理延迟加载.关于滚动的问题已经之前多次回答,因此您的问题实际上是重复的.将来你应该展示你尝试过的东西(你自己在执行部分的努力)并展示回溯.

If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.

check_height = driver.execute_script("return document.body.scrollHeight;") 
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    height = driver.execute_script("return document.body.scrollHeight;") 
    if height == check_height: 
        break 
     check_height = height

这篇关于如何从操纵延迟加载方法的网页中获取所有数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆