用python抓取谷歌新闻 [英] web scraping google news with python

查看:75
本文介绍了用python抓取谷歌新闻的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为不同的新闻媒体创建一个网络抓取工具,对于纽约时报和卫报来说这很容易,因为他们有自己的 API.

I am creating a web scraper for different news outlets, for Nytimes and the Guardian it was easy since they have their own API.

现在,我想从GulfTimes.com 的这份报纸上抓取结果.他们没有在他们的网站上提供高级搜索,所以我求助于谷歌新闻.但是,Google 新闻 Api 已被弃用.我想要的是从高级搜索中检索结果数量,例如关键字 =埃及"和开始日期=10/02/2011"和结束日期=10/05/2011".

Now, I want to scrape results from this newspaper GulfTimes.com. They do not provide an advanced search in their website, so I resorted to Google news. However, Google news Api has been deprecated. What i want is to retrieve the number of results from an advanced search like keyword = "Egypt" and begin_date="10/02/2011" and end_date="10/05/2011".

这在 Google 新闻 UI 中是可行的,只需将来源作为海湾时报"和相应的查询和日期,并简单地手动计算结果的数量,但是当我尝试使用 python 执行此操作时,我收到 403 错误这是可以理解的.

This is feasible in the Google News UI just by putting the source as "Gulf Times" and the corresponding query and date and simply counting manually the number of results but when I try to do this using python, I get a 403 error which is understandable.

知道我将如何做到这一点吗?或者除了谷歌新闻之外还有其他服务可以让我这样做吗?请记住,我会一次发出近 500 个请求.

Any idea on how I would do this? Or is there another service besides Google news that would allow me to do this? Keeping in mind that I would issue almost 500 requests at once.

import json
import urllib2
import cookielib
import re
from bs4 import BeautifulSoup


def run():
   Query = "Egypt"
   Month = "3"
   FromDay = "2"
   ToDay = "4"
   Year = "13"
   url='https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q='+Query+'&as_occt=any&as_drrb=b&as_mindate='+Month+'%2F'+FromDay+'%2F'+Year+'&as_maxdate='+Month+'%2F'+ToDay+'%2F'+Year+'&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
   cj = cookielib.CookieJar()
   opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
   request = urllib2.Request(url)   
   response = opener.open(request)
   htmlFile = BeautifulSoup(response)
   print htmlFile


run()

推荐答案

你可以使用 awesome requests 图书馆:

You can use awesome requests library:

import requests

URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'


def run(**params):
    response = requests.get(URL.format(**params))
    print response.content, response.status_code


run(query="Egypt", month=3, from_day=2, to_day=2, year=13)

你会得到 status_code=200.

And you'll get status_code=200.

顺便说一下,看看 scrapy 项目.没有什么比这个工具让网络抓取更简单了.

And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.

这篇关于用python抓取谷歌新闻的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆