循环浏览网页并下载所有图像 [英] Loop through webpages and download all images

查看:56
本文介绍了循环浏览网页并下载所有图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很好的URL结构可以循环通过:

I have a nice URL structure to loop through:

https://marco.ccr.buffalo.edu/images?page=0&score=Clear
https://marco.ccr.buffalo.edu/images?page=1&score=Clear
https://marco.ccr.buffalo.edu/images?page=2&score=Clear
...

我想循环浏览每个页面并下载21张图像(JPEG或PNG).我已经看过几个Beautiful Soap的示例,但是Im仍在努力获取可以下载多个图像并遍历URL的内容.我想我可以使用urllib这样遍历每个URL,但是我不确定图像保存在哪里.任何帮助将不胜感激,并在此先感谢!

I want to loop through each of these pages and download the 21 images (JPEG or PNG). I've seen several Beautiful Soap examples, but Im still struggling to get something that will download multiple images and loop through the URLs. I think I can use urllib to loop through each URL like this, but Im not sure where the image saving comes in. Any help would be appreciated and thanks in advance!

for i in range(0,10):
    urllib.urlretrieve('https://marco.ccr.buffalo.edu/images?page=' + str(i) + '&score=Clear')

我正在尝试关注此帖子,但未成功:如何从中提取和下载所有图像一个使用beautifulSoup的网站?

I was trying to follow this post but I was unsuccessful: How to extract and download all images from a website using beautifulSoup?

推荐答案

您可以使用请求:

from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os

@contextlib.contextmanager
def get_images(url:str):
  d = soup(requests.get(url).text, 'html.parser') 
  yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]

n = 3 #end value
os.system('mkdir MARCO_images') #added for automation purposes, folder can be named anything, as long as the proper name is used when saving below
for i in range(n):
  with get_images(f'https://marco.ccr.buffalo.edu/images?page={i}&score=Clear') as links:
    print(links)
    for c, [link, ext] in enumerate(links, 1):
       with open(f'MARCO_images/MARCO_img_{i}{c}.{ext}', 'wb') as f:
           f.write(requests.get(f'https://marco.ccr.buffalo.edu{link}').content)


现在,检查 MARCO_images 目录的内容将产生:

print(os.listdir('/Users/ajax/MARCO_images'))

输出:

['MARCO_img_1.jpg', 'MARCO_img_10.jpg', 'MARCO_img_11.jpg', 'MARCO_img_12.jpg', 'MARCO_img_13.jpg', 'MARCO_img_14.jpg', 'MARCO_img_15.jpg', 'MARCO_img_16.jpg', 'MARCO_img_17.jpg', 'MARCO_img_18.jpg', 'MARCO_img_19.jpg', 'MARCO_img_2.jpg', 'MARCO_img_20.jpg', 'MARCO_img_21.jpg', 'MARCO_img_3.jpg', 'MARCO_img_4.jpg', 'MARCO_img_5.jpg', 'MARCO_img_6.jpg', 'MARCO_img_7.jpg', 'MARCO_img_8.jpg', 'MARCO_img_9.jpg']

这篇关于循环浏览网页并下载所有图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆