Python请求抓取图像以“data:image/"格式返回src [英] Python requests scrape image returns src in format "data:image/"

查看:55
本文介绍了Python请求抓取图像以“data:image/"格式返回src的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 google 图片搜索结果中抓取第一张图片,因为我不想手动为 100 个关键字执行此操作.

I'm trying to scrape the first image off a google image search result as I don't want to do it manually for 100 keywords.

使用此代码:

from bs4 import BeautifulSoup
import requests
import json


query="koko"
url = "https://www.google.com/search?q=" + str(query) + "&source=lnms&tbm=isch"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}


html = requests.get(url, headers=headers).text

soup = BeautifulSoup(html, 'html.parser')
images = soup.findAll("img")

images[0] is <img alt="Koko,知道手语的大猩猩,死于 46 岁 - 芝加哥论坛报";类=rg_i Q4LuWd";数据延迟=1"数据-iid =0"高度=157"jsname="Q4LuWd";src=数据:图像/gif;base64,R0lGODlhAQABAIAAAP//////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="宽度=200"/>

返回的 src 具有这种格式,我认为它是我不想要的 base64,我想要一个普通的图像链接.

The src returned has this format which I believe is base64 which I don't want, I want a normal image link.

如果我在 Chrome 浏览器上禁用 javascript 并导航到 https://www.google.com/search?q=koko&source=lnms&tbm=isch 并查看源代码,返回的 img 的 src 是我需要的正常格式.

If I disable javascript on my chrome browser and navigate to https://www.google.com/search?q=koko&source=lnms&tbm=isch and view the source code, the src of img returned is in the normal format that I need.

我无法使 requests html 与禁用的 javascript chrome 相同.

I can't manage to make requests html be the same as disabled javascript chrome.

我尝试更改我的 User-Agent,并尝试将与 chrome 相同的用户代理匹配,但它没有改变结果.

I tried to change my User-Agent and tried to match the same one I have with chrome but it doesn't change the result.

推荐答案

要获取所有图片,设置content-type header:

To get all images, set the content-type header:

from bs4 import BeautifulSoup
import requests


query = "koko"
url = "https://www.google.com/search?q=" + str(query) + "&source=lnms&tbm=isch"

HEADERS = {"content-type": "image/png"}

html = requests.get(url, headers=HEADERS).text

soup = BeautifulSoup(html, "html.parser")

for img in soup.find_all("img"):
    print(img["src"])

输出:

/images/branding/searchlogo/1x/googlelogo_desk_heirloom_color_150x55dp.gif
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSYjINUgXtYyrUB4fKyaVxXCAkSyc_Q5b0QaeohUxmjdiIQwS_9CPXgWCXrUGQ&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR1UnMwOo_8tpFkm04yby_I0HdMbfh6-GnhVWnKhOF1qnSP4ogODEn3AAo7V0M&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSHwKA_l2i20z0yeGMr_imQcB-tffAfL0xcQAKmbFn1-NtVrHn8AtTv9aql2Q&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR6UNEOYT2BwMVrjXo8WW6CS0rUHC0QLIqA-GdO1CLGk7mxw8lhWgMyI-uW4A&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT0dQsIKidzCcvdpvL0FDIfZ4Q3WL8GUKCCbwnK4V7FJ6nCGDVNbFmhnD7eOJ8&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSeeYoW5maZW69VamkrN_vzjQoxIQl-RFrcZK58rCry1ZDpyIT6FVaG1IFsKw&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS3wy4vKh6ey8SAZHRxe-sKa1LEiBBdk6cbjELSGkoQn1YINb_YZSRanpOzR38&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSnw0tBokCloEzt0QDpnTVvJYJr1ZDngx7Znz6nLCbjZbq2Vn3g57iEUKordQ&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTZsq7Dy3-bT8miOPD_GE8_1X3isDl67A1ucNauliVlV4dIWgqleLY1OFyLjw&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQsjLjEmJ2kFrdoiU1O0CE_d2bazVxl4IPaHJy2Ea_PhI-B0_4jXcDcuLo2PQ&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlvVOp05edZGkjz6q3QN8vqPsC-h-lIRlFyU16wYefNRG3zVlFQ2XeJRH3mMU&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT1QKRBEW1WOZs-bS15vTjzYutHLYNIis6Ji60bcJ_mXvA1tYjYYrD-Nk9cWMc&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQiqRS7ry4rNx8VNA4F6TUmm_ZaTtcp4iXokZF_WT-M7zEkF9YG7PpWKpPhSg&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSNtY7c7Qg-w9wXmKfhSHrop5b4tb2wCQoK5pLj_RA1eCPXAn4TNNtEVA8RG_U&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTH8zHuDssfuFW0PUpqNnQoG0yTkebQ194uy7auEzzodGuSAYqsF8flYTW3VAE&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDowATaKwsMkiN1aQj9e6J2VfMUm6742KW3ifxqddk4UHWSX-WOWDeTDSi_w&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTaCZKWiYg2tEUNerLa1zcmUD25-ZVC0RCDY1E1iby3PnHIJOY7cFhTZd8Em8M&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfg8euHcq0wcUrtIHxleulXlTzbuehiZBb1DgJTEs3GdiG5l5bTdRt0Ug-Qg&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRPbAOCCA3diC-W5CtqbmpegeWPw-ReQPxBDaHN2YPH6OIqWC16dj5uNbhXhw&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSnrICqNqL_KG42rZ2_B7nKdZr-INrqsdZqfzeAbFrJYsBez0GDvKtIrwJjP5U&s

这篇关于Python请求抓取图像以“data:image/"格式返回src的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆