使用BeautifulSoup进行硒滚动和刮除会产生重复的结果 [英] Selenium scrolling and scraping with BeautifulSoup produces duplicate results
问题描述
我有这个脚本可以从Instagram下载图像.我唯一遇到的问题是,当Selenium开始向下滚动到网页底部时,在请求被循环之后,BeautifulSoup开始抓取相同的img src
链接.
I have this script to download images from Instagram. The only issue I am having is that when Selenium starts scrolling down to the bottom of the webpage, BeautifulSoup starts grabbing the same img src
links after requests is being looped.
尽管它将继续向下滚动并下载图片,但在完成所有操作后,我最终将得到2或3个重复项.所以我的问题是,有没有办法防止这种重复发生?
Although it will continue to scroll down and download pictures, after all that is done, I end up having 2 or 3 duplicates. So my question is is there a way of preventing this duplication from happening?
import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)
scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0
print('[+] Downloading:\n')
def screens(get_name):
with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
r = requests.get(img_url)
f.write(r.content)
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_delay)
new_height = driver.execute_script("return document.body.scrollHeight")
soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
img_url = img["src"]
print('=> [+] img_{}'.format(counter))
screens(counter)
counter = counter + 1
if new_height == last_height:
break
last_height = new_height
更新:
因此,我将这部分代码放在while True
之外,并让selenium首先加载整个页面,以期希望bs4抓取所有图像.它只能工作到30号,然后停止.
Update:
So I placed this part of the code outside of while True
and let selenium load the whole page first in order to hopefully have bs4 scrape all the images. It works to number 30 only and then stops.
soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
#tn = datetime.now().strftime('%H:%M:%S')
img_url = img["src"]
print('=> [+] img_{}'.format(counter))
screens(counter)
counter = counter + 1
推荐答案
之所以仅在脚本的第二个版本中加载30,是因为其余元素已从页面DOM中删除,并且不再是页面DOM的一部分. BeautifulSoup看到的来源.解决方案是继续做您第一次的工作,但是在遍历列表并调用screens()
之前,请删除所有重复的元素.您可以使用集合,如下所示,虽然我不确定这是否是绝对最有效的方法:
The reason it only loads 30 in your second version of your script is because the rest of the elements are removed from the page DOM and are no longer part of the source that BeautifulSoup sees. The solution is to keep doing what you were doing the first time, but to remove any duplicate elements before you iterate through the list and call screens()
. You can do this using sets as below, though I'm not sure if this is the absolute most efficient way to do it:
import requests
import selenium.webdriver as webdriver
import time
driver = webdriver.Firefox()
url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)
scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0
print('[+] Downloading:\n')
def screens(get_name):
with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
r = requests.get(img_url)
f.write(r.content)
old_imgs = set()
while True:
imgs = driver.find_elements_by_class_name('_2di5p')
imgs_dedupe = set(imgs) - set(old_imgs)
for img in imgs_dedupe:
img_url = img.get_attribute("src")
print('=> [+] img_{}'.format(counter))
screens(counter)
counter = counter + 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_delay)
new_height = driver.execute_script("return document.body.scrollHeight")
old_imgs = imgs
if new_height == last_height:
break
last_height = new_height
driver.quit()
如您所见,我使用了另一页进行测试,其中包含420张猫的图像.结果是420张图片,即该帐户上的帖子数量,其中没有重复.
As you can see, I used a different page to test it, one with 420 images of cats. The result was 420 images, the number of posts on that account, with no duplicates among them.
这篇关于使用BeautifulSoup进行硒滚动和刮除会产生重复的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!