在 Python 中使用 Selenium 抓取 Google 图片 [英] Scraping Google Images using Selenium in Python

查看:79
本文介绍了在 Python 中使用 Selenium 抓取 Google 图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我一直在尝试使用以下代码抓取谷歌图片:

from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.common.keys 导入密钥导入操作系统导入时间进口请求进口重新导入 urllib2进口重新从线程导入线程导入json#假设我有一个名为 Pictures1 的文件夹,图像会在那里下载.def threaded_func(url,i):raw_img = urllib2.urlopen(url).read()cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')f.write(raw_img)f.close()驱动程序 = webdriver.Firefox()driver.get("https://images.google.com/")elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')elem.clear()elem.send_keys("鹦鹉")elem.send_keys(Keys.RETURN)image_type = "parrot_defG"图像=[]总计=0时间.睡眠(10)对于一个 in driver.find_elements_by_class_name('rg_meta'):链接 =json.loads(a.text)["ou"]线程 = 线程(目标 = threaded_func,args =(链接,总计))线程开始()线程连接()总计+=1

我尝试使用 Selenium 打开 google 的图像结果页面,然后注意到每个 div 都有类rg-meta",后面跟着 JSON 代码.

我尝试使用 .text 访问它.JSON 的ou"索引包含我尝试下载的图像的来源.我正在尝试使用rg-meta"类获取所有此类 div 并下载图像.但它显示错误NO JSON OBJECT CAN BE DECODED",我不知道该怎么做.

这就是我要说的:

 <div class="rg_meta">{"cl":3,"id":"FqCGaup9noXlMM:","isu":"kids.britannica.com","itg":false,"ity":"jpg","oh":600,"ou":"http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg","ow":380,"pt":"谷物象鼻虫——儿童百科全书|儿童作业帮助...","rid":"EusB0pk_sLg7vM","ru":"http://kids.britannica.com/comptons/art-143712/grain-or-granary-weevil","s":"grain weevil","sc":1,"st":"Kids Britannica","th":282,"tu":"https://encrypted-tbn2.gstatic.com/images?q\u003dtbn:ANd9GcQPbgXbRVzOicvPfBRtAkLOpJwy_wDQEC6a2q0BuTsUx-s0-h4b","tw":179}</div>

检查 JSON 的ou"索引.请帮我提取它.

请原谅我的无知.

这是我通过进行以下更新来解决它的方法:

 for a in driver.find_elements_by_xpath('//div[@class="rg_meta"]'):atext = a.get_attribute('innerHTML')链接 =json.loads(atext)["ou"]打印链接线程 = 线程(目标 = threaded_func,args =(链接,总计))线程开始()线程连接()总计+=1

解决方案

替换:

driver.find_elements_by_class_name('rg_meta')driver.find_element_by_xpath('//div[@class="rg_meta"]/text()') >

a.texta

将解决您的问题.

结果代码:

from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.common.keys 导入密钥导入操作系统导入时间进口请求进口重新导入 urllib2进口重新从线程导入线程导入json#假设我有一个名为 Pictures1 的文件夹,图像会在那里下载.def threaded_func(url,i):raw_img = urllib2.urlopen(url).read()cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')f.write(raw_img)f.close()驱动程序 = webdriver.Firefox()driver.get("https://images.google.com/")elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')elem.clear()elem.send_keys("鹦鹉")elem.send_keys(Keys.RETURN)image_type = "parrot_defG"图像=[]总计=0时间.睡眠(10)对于一个 in driver.find_element_by_xpath('//div[@class="rg_meta"]/text()'):链接 =json.loads(a)["ou"]线程 = 线程(目标 = threaded_func,args =(链接,总计))线程开始()线程连接()总计+=1

打印链接导致:

http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg

Now, I have been trying to scrape google images using the following code :

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 
import os
import time
import requests
import re
import urllib2
import re
from threading import Thread
import json
#Assuming I have a folder named Pictures1, the images are downloaded there. 
def threaded_func(url,i):
     raw_img = urllib2.urlopen(url).read()
     cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1
     f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')
     f.write(raw_img)
     f.close()
driver = webdriver.Firefox()
driver.get("https://images.google.com/")
elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')
elem.clear()
elem.send_keys("parrot")
elem.send_keys(Keys.RETURN)
image_type = "parrot_defG"
images=[]
total=0
time.sleep(10)
for a in driver.find_elements_by_class_name('rg_meta'):
     link =json.loads(a.text)["ou"]
     thread = Thread(target = threaded_func, args = (link,total))
     thread.start()
     thread.join()
     total+=1

I tried to open the image results page of google using Selenium and then notice that every div has class 'rg-meta' and it is followed by JSON code .

I tried to access it using .text . The 'ou' index of JSON has the source of the image I am trying to download. I am trying to get all such divs with class 'rg-meta' and downloading the images. But it shows the error " NO JSON OBJECT CAN BE DECODED" and I have no idea what to do.

EDIT: This is what I am talking about :

    <div class="rg_meta">{"cl":3,"id":"FqCGaup9noXlMM:","isu":"kids.britannica.com","itg":false,"ity":"jpg","oh":600,"ou":"http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg","ow":380,"pt":"grain weevil -- Kids Encyclopedia | Children\u0026#39;s Homework Help ...","rid":"EusB0pk_sLg7vM","ru":"http://kids.britannica.com/comptons/art-143712/grain-or-granary-weevil","s":"grain weevil","sc":1,"st":"Kids Britannica","th":282,"tu":"https://encrypted-tbn2.gstatic.com/images?q\u003dtbn:ANd9GcQPbgXbRVzOicvPfBRtAkLOpJwy_wDQEC6a2q0BuTsUx-s0-h4b","tw":179}</div>

Check the "ou" index of the JSON. Please help me extract it.

Forgive me for my ignorance.

This is how I have solved it by making the following update :

    for a in driver.find_elements_by_xpath('//div[@class="rg_meta"]'):
        atext = a.get_attribute('innerHTML')
        link =json.loads(atext)["ou"]
        print link
        thread = Thread(target = threaded_func, args = (link,total))
        thread.start()
        thread.join()
        total+=1

解决方案

Replacing:

driver.find_elements_by_class_name('rg_meta') with driver.find_element_by_xpath('//div[@class="rg_meta"]/text()')

and a.text with a

will resolve your issue.

The resultant code :

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 
import os
import time
import requests
import re
import urllib2
import re
from threading import Thread
import json
#Assuming I have a folder named Pictures1, the images are downloaded there. 
def threaded_func(url,i):
     raw_img = urllib2.urlopen(url).read()
     cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1
     f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')
     f.write(raw_img)
     f.close()
driver = webdriver.Firefox()
driver.get("https://images.google.com/")
elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')
elem.clear()
elem.send_keys("parrot")
elem.send_keys(Keys.RETURN)
image_type = "parrot_defG"
images=[]
total=0
time.sleep(10)
for a in driver.find_element_by_xpath('//div[@class="rg_meta"]/text()'):
     link =json.loads(a)["ou"]
     thread = Thread(target = threaded_func, args = (link,total))
     thread.start()
     thread.join()
     total+=1

Printing link results in :

http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg

这篇关于在 Python 中使用 Selenium 抓取 Google 图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆