使用Python的Google Scrape中错误的结果数 [英] Wrong number of results in Google Scrape with Python

查看:73
本文介绍了使用Python的Google Scrape中错误的结果数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试学习网页抓取功能,但遇到了一个怪异的问题...我的任务是在特定日期范围内搜索Google上某个主题的新闻,然后计算搜索结果的数量.

I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.

我的简单代码是

import requests,  bs4

payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:1/01/2015,cd_max:1/01/2015','tbm':'nws'}    
r = requests.get("https://www.google.com/search", params=payload)

soup = bs4.BeautifulSoup(r.text)
elems = soup.select('#resultStats')
print(elems[0].getText())

我得到的结果是

About 8,600 results

因此,除了结果错误的事实之外,显然所有方法都有效.如果我在Firefox中打开URL(可以使用r.url获取完整的URL)

So apparently all works... apart from the fact that the result is wrong. If I open the URL in Firefox (I can obtain the complete URL with r.url)

https://www.google.com/search?tbm=nws&as_epq=James+Clark&tbs=cdr%3A1%2Ccd_min%3A1%2F01%2F2015%2Ccd_max%3A1%2F01%2F2015

我发现结果实际上只是 2 ,如果我手动下载HTML文件,请打开页面源并搜索 id ="resultStats" 结果数的确是2!

I see that the results are actually only 2, and if I manually download the HTML file, open the page source and search for id="resultStats" I find that the number of results is indeed 2!

有人可以帮助我理解为什么在保存的HTML文件和汤中搜索相同的id标签会导致两个不同的数值结果吗?

Can anybody help me to understand why searching for the same id tag in the saved HTML file and in the soup item lead to two different numerical results?

**************更新 看来问题出在自定义日期范围,requests.get无法正确处理该日期范围.如果我在中使用相同的URL,我会得到正确的答案

************** UPDATE It seems that the problem is the custom date range that does not get processed correctly by requests.get. If I use the same URL with selenium I get the correct answer

from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
elems = soup.select('#resultStats')
print(elems[0].getText())

答案是

2 results (0.09 seconds) 

问题在于这种方法似乎比较麻烦,因为我需要在Firefox中打开页面...

The problem is that this methodology seems to be more cumbersome because I need to open the page in Firefox...

推荐答案

导致此问题的原因有很多.首先,它希望日期的日期和月份部分为两位数,并且还期望某些流行浏览器的用户代理字符串.以下代码应该可以工作:

There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:

import requests,  bs4

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text

这篇关于使用Python的Google Scrape中错误的结果数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆