Python3 urllib 图像检索 [英] Python3 urllib image retreval

查看:33
本文介绍了Python3 urllib 图像检索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个小的 Python 脚本来通过谷歌图片抓取图片.我已经设法使事情达到了在一个方便的列表中我想要的图像的网址的程度.现在,我只需要抓住它们...

I'm writing a small Python script to grab images via google images. I've managed to get things up to the point where I have the urls of the images I want in a handy list. Now, I just need to grab them...

对于每个图片网址,我都这样做:

for each image url i do this:

    print("Retrieving:{0}".format(sFinalImageURL))
    sExt = sFinalImageURL.split('.')[-1]
    #u = urllib.request.urlopen(sFinalImageURL)
    try:
        u = urllib.request.urlopen(sFinalImageURL)
    except:
        print("error: cannot retrieve image")
        continue
    raw_data = u.read()
    print("read {0} bytes".format(len(raw_data)))
    u.close()
    global sImagesFolder
    try:
        f = open("{0}/{1}_{2}.{3}".format(sImagesFolder,sImage,i,sExt),'wb')
        f.write(raw_data)
        f.close()
    except:
        print("couldn't write to {0}/{1}_{2}.{3}".format(sImagesFolder,sImage,i,sExt))
    print()

以下是我遇到的问题:

即使我可以直接在浏览器中打开 URL,尝试打开一些 URL 也会给我 403.所以 HTTP 请求头中有一些图片服务器不喜欢的东西......有什么想法吗?

trying to open some off the URLs gives me 403 even though I can open the URLs straight in my browser. So there's something in the HTTP request header that the image server doesn't like... any ideas?

以下是一些输出:

Retrieving:http://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Timba%2B1.jpg/220px-Timba%2B1.jpg

error: cannot retrieve image

Retrieving:http://upload.wikimedia.org/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/260px-YellowLabradorLooking_new.jpg

error: cannot retrieve image
Retrieving:http://1.bp.blogspot.com/-7SsJ1n3RdoA/Tf07NOgD5nI/AAAAAAAAABo/tl8qLLIU01Y/s1600/english-shepherd-dog-0003.jpg

read 11123 bytes

Retrieving:http://completedogfood.net/wp-content/uploads/2010/07/complete-dog-food.bmp
read 419630 bytes

推荐答案

似乎维基百科只允许访问真实浏览器.
这个问题可以通过指定一个真实浏览器的User-Agent字符串来解决,因为Python的urllib发送了类似Python-urllib/3.2的东西默认.

It seems like Wikipedia only allows access to real browsers.
The problem can be solved by specifying a User-Agent string of a real browser, because Python's urllib sends something like Python-urllib/3.2 by default.

这是一个有效的示例(使用我使用的浏览器的 User-Agent 字符串):

Here's an example that works (with User-Agent string of the browser that I use):

url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Timba%2B1.jpg/220px-Timba%2B1.jpg'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19'
u = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': user_agent}))

这篇关于Python3 urllib 图像检索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆