使用BeautifulSoup进行硒滚动和刮除会产生重复的结果 [英] Selenium scrolling and scraping with BeautifulSoup produces duplicate results

查看:100
本文介绍了使用BeautifulSoup进行硒滚动和刮除会产生重复的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个脚本可以从Instagram下载图像.我唯一遇到的问题是,当Selenium开始向下滚动到网页底部时,在请求被循环之后,BeautifulSoup开始抓取相同的img src链接.

I have this script to download images from Instagram. The only issue I am having is that when Selenium starts scrolling down to the bottom of the webpage, BeautifulSoup starts grabbing the same img src links after requests is being looped.

尽管它将继续向下滚动并下载图片,但在完成所有操作后,我最终将得到2或3个重复项.所以我的问题是,有没有办法防止这种重复发生?

Although it will continue to scroll down and download pictures, after all that is done, I end up having 2 or 3 duplicates. So my question is is there a way of preventing this duplication from happening?

import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver


url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)

scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    soup = BeautifulSoup(driver.page_source, 'lxml')
    imgs = soup.find_all('img', class_='_2di5p')
    for img in imgs:
        img_url = img["src"]
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    if new_height == last_height:
        break
    last_height = new_height


更新: 因此,我将这部分代码放在while True之外,并让selenium首先加载整个页面,以期希望bs4抓取所有图像.它只能工作到30号,然后停止.

Update: So I placed this part of the code outside of while True and let selenium load the whole page first in order to hopefully have bs4 scrape all the images. It works to number 30 only and then stops.

soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
    #tn = datetime.now().strftime('%H:%M:%S')
    img_url = img["src"]
    print('=> [+] img_{}'.format(counter))
    screens(counter)
    counter = counter + 1

推荐答案

之所以仅在脚本的第二个版本中加载30,是因为其余元素已从页面DOM中删除,并且不再是页面DOM的一部分. BeautifulSoup看到的来源.解决方案是继续做您第一次的工作,但是在遍历列表并调用screens()之前,请删除所有重复的元素.您可以使用集合,如下所示,虽然我不确定这是否是绝对最有效的方法:

The reason it only loads 30 in your second version of your script is because the rest of the elements are removed from the page DOM and are no longer part of the source that BeautifulSoup sees. The solution is to keep doing what you were doing the first time, but to remove any duplicate elements before you iterate through the list and call screens(). You can do this using sets as below, though I'm not sure if this is the absolute most efficient way to do it:

import requests
import selenium.webdriver as webdriver
import time

driver = webdriver.Firefox()

url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)

scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

old_imgs = set()

while True:

    imgs = driver.find_elements_by_class_name('_2di5p')

    imgs_dedupe = set(imgs) - set(old_imgs)

    for img in imgs_dedupe:
        img_url = img.get_attribute("src")
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    old_imgs = imgs

    if new_height == last_height:
        break
    last_height = new_height

driver.quit()

如您所见,我使用了另一页进行测试,其中包含420张猫的图像.结果是420张图片,即该帐户上的帖子数量,其中没有重复.

As you can see, I used a different page to test it, one with 420 images of cats. The result was 420 images, the number of posts on that account, with no duplicates among them.

这篇关于使用BeautifulSoup进行硒滚动和刮除会产生重复的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆