使用 Python 的 Google Scrape 中的结果数量错误 [英] Wrong number of results in Google Scrape with Python

查看:18
本文介绍了使用 Python 的 Google Scrape 中的结果数量错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习网络抓取,但遇到了一个奇怪的问题...我的任务是在 Google 上搜索特定日期范围内某个主题的新闻并计算结果数量.

I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.

我的简单代码是

import requests,  bs4

payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:1/01/2015,cd_max:1/01/2015','tbm':'nws'}    
r = requests.get("https://www.google.com/search", params=payload)

soup = bs4.BeautifulSoup(r.text)
elems = soup.select('#resultStats')
print(elems[0].getText())

我得到的结果是

About 8,600 results

所以显然一切正常......除了结果是错误的.如果我在 Firefox 中打开 URL(我可以使用 r.url 获取完整的 URL)

So apparently all works... apart from the fact that the result is wrong. If I open the URL in Firefox (I can obtain the complete URL with r.url)

https://www.google.com/search?tbm=nws&as_epq=James+Clark&tbs=cdr%3A1%2Ccd_min%3A1%2F01%2F2015%2Ccd_max%3A1%2F01%2F2015

我看到结果实际上只有2,如果我手动下载HTML文件,打开页面源并搜索id="resultStats"我发现结果的数量确实是2!

I see that the results are actually only 2, and if I manually download the HTML file, open the page source and search for id="resultStats" I find that the number of results is indeed 2!

谁能帮我理解为什么在保存的 HTML 文件和汤项中搜索相同的 id 标签会导致两个不同的数字结果?

Can anybody help me to understand why searching for the same id tag in the saved HTML file and in the soup item lead to two different numerical results?

**************** 更新似乎问题在于 requests.get 未正确处理自定义日期范围.如果我对 selenium 使用相同的 URL,我会得到正确的答案

************** UPDATE It seems that the problem is the custom date range that does not get processed correctly by requests.get. If I use the same URL with selenium I get the correct answer

from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
elems = soup.select('#resultStats')
print(elems[0].getText())

答案是

2 results (0.09 seconds) 

问题是这种方法似乎比较麻烦,因为我需要在Firefox中打开页面...

The problem is that this methodology seems to be more cumbersome because I need to open the page in Firefox...

推荐答案

导致此问题的原因有很多.首先,它需要 2 位日期的日期和月份部分,并且还需要一些流行浏览器的用户代理字符串.以下代码应该可以工作:

There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:

import requests,  bs4

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text

这篇关于使用 Python 的 Google Scrape 中的结果数量错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆