使用Python的Google搜索网络抓取 [英] Google Search Web Scraping with Python

查看:200
本文介绍了使用Python的Google搜索网络抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在学习一些Python,以便在工作中的某些项目上工作.

I've been learning a lot of python lately to work on some projects at work.

当前,我需要对Google搜索结果进行一些网页抓取.我发现了几个站点,这些站点演示了如何使用Ajax谷歌api进行搜索,但是在尝试使用它之后,似乎不再受支持.有什么建议?

Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?

我一直在寻找一种方法,但是似乎找不到任何有效的解决方案.

I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.

推荐答案

您始终可以直接抓取Google结果.为此,您可以使用URL https://google.com/search?q=<Query>,它将返回前10个搜索结果.

You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.

然后,您可以使用 lxml 来解析页面.根据您的使用方式,您可以通过CSS选择器(.r a)或XPath选择器(//h3[@class="r"]/a)

Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[@class="r"]/a)

在某些情况下,生成的URL将重定向到Google.通常,它包含一个查询参数q,其中将包含实际的请求URL.

In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.

使用lxml和请求的示例代码:

Example code using lxml and requests:

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

关于Google禁止您使用IP的说明:以我的经验,Google仅禁止 如果您开始向Google发送带有搜索请求的垃圾邮件.它将回应 如果Google认为您是机器人,则显示503.

A note on google banning your IP: In my experience, google only bans if you start spamming google with search requests. It will respond with a 503 if Google thinks you are bot.

这篇关于使用Python的Google搜索网络抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆