用python抓取谷歌新闻 [英] web scraping google news with python
问题描述
我正在为不同的新闻媒体创建一个网络抓取工具,对于纽约时报和卫报来说这很容易,因为他们有自己的 API.
I am creating a web scraper for different news outlets, for Nytimes and the Guardian it was easy since they have their own API.
现在,我想从GulfTimes.com 的这份报纸上抓取结果.他们没有在他们的网站上提供高级搜索,所以我求助于谷歌新闻.但是,Google 新闻 Api 已被弃用.我想要的是从高级搜索中检索结果数量,例如关键字 =埃及"和开始日期=10/02/2011"和结束日期=10/05/2011".
Now, I want to scrape results from this newspaper GulfTimes.com. They do not provide an advanced search in their website, so I resorted to Google news. However, Google news Api has been deprecated. What i want is to retrieve the number of results from an advanced search like keyword = "Egypt" and begin_date="10/02/2011" and end_date="10/05/2011".
这在 Google 新闻 UI 中是可行的,只需将来源作为海湾时报"和相应的查询和日期,并简单地手动计算结果的数量,但是当我尝试使用 python 执行此操作时,我收到 403 错误这是可以理解的.
This is feasible in the Google News UI just by putting the source as "Gulf Times" and the corresponding query and date and simply counting manually the number of results but when I try to do this using python, I get a 403 error which is understandable.
知道我将如何做到这一点吗?或者除了谷歌新闻之外还有其他服务可以让我这样做吗?请记住,我会一次发出近 500 个请求.
Any idea on how I would do this? Or is there another service besides Google news that would allow me to do this? Keeping in mind that I would issue almost 500 requests at once.
import json
import urllib2
import cookielib
import re
from bs4 import BeautifulSoup
def run():
Query = "Egypt"
Month = "3"
FromDay = "2"
ToDay = "4"
Year = "13"
url='https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q='+Query+'&as_occt=any&as_drrb=b&as_mindate='+Month+'%2F'+FromDay+'%2F'+Year+'&as_maxdate='+Month+'%2F'+ToDay+'%2F'+Year+'&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
htmlFile = BeautifulSoup(response)
print htmlFile
run()
推荐答案
你可以使用 awesome requests 图书馆:
You can use awesome requests library:
import requests
URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
def run(**params):
response = requests.get(URL.format(**params))
print response.content, response.status_code
run(query="Egypt", month=3, from_day=2, to_day=2, year=13)
你会得到 status_code=200.
And you'll get status_code=200.
顺便说一下,看看 scrapy 项目.没有什么比这个工具让网络抓取更简单了.
And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.
这篇关于用python抓取谷歌新闻的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!