Node Jsdom Scrape谷歌的反向图像搜索 [英] Node Jsdom Scrape Google's Reverse Image Search

查看:109
本文介绍了Node Jsdom Scrape谷歌的反向图像搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在给定图片网址的情况下以编程方式查找类似图片的网址列表。我找不到任何免费的图片搜索API,所以我试图通过抓取谷歌的按图搜索

I want to programatically find a list of URLs for similar images given an image URL. I can't find any free image search APIs so I'm trying to do this by scraping Google's Search by Image.

如果我有图片网址,请说 http://i.imgur.com/oLmwq.png ,然后导航到 https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png 给出相关的图像和信息。

If I have an image URL, say http://i.imgur.com/oLmwq.png, then navigating to https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png gives related images and info.

如何获得 jsdom.env 生成浏览器从上述网址获取的HTML?

How do I get jsdom.env to produce the HTML your browser gets from the above URL?

这是我尝试过的( CoffeeScript ):

jsdom = require 'jsdom'
url = 'https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png'
jsdom.env
    html: url
    scripts: [ "http://code.jquery.com/jquery.js" ]
    features:
        FetchExternalResources: ['script']
        ProcessExternalResources: ['script']
    done: (errors, window) ->
        console.log window.$('body').html()

你可以看到HTML与我们想要的不符。这是Jsdom的HTTP标头的问题吗?

You can see the HTML doesn't match what we want. Is this an issue with Jsdom's HTTP headers?

推荐答案

问题是Jsdom的User-Agent HTTP标头。一旦设置完毕,一切(几乎)都有效:

The issue is Jsdom's User-Agent HTTP header. Once that is set everything (almost) works:

jsdom = require 'jsdom'
url = 'https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png'
jsdom.env
    html: url
    headers:
        'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
    scripts: [ "http://code.jquery.com/jquery.js" ]
    features:
        FetchExternalResources: ['script']
        ProcessExternalResources: ['script']

    done: (errors, window) ->
        $ = window.$
        $('#iur img').parent().each (index, elem) ->
            href = $(elem).attr 'href'
            url = href.split('?')[1].split('&')[0].split('=')[1]
            console.log url

这给了我们一个视觉上相似的图像清单。现在唯一的问题是Jsdom在返回结果后抛出错误:

Which gives us a nice list of visually similar images. The only problem now is Jsdom throws an error after returning the result:

timers.js:103
            if (!process.listeners('uncaughtException').length) throw e;
                                                                      ^
TypeError: Cannot call method 'call' of undefined
    at new <anonymous> (/project-root/node_modules/jsdom/lib/jsdom/browser/index.js:54:13)
    at _.Zl (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1238:93)
    at _.jm (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1239:399)
    at _.km (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1241:146)
    at Object._onTimeout (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1248:727)
    at Timer.list.ontimeout (timers.js:101:19)

这篇关于Node Jsdom Scrape谷歌的反向图像搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆