从Google搜索页中抓取摘要文本 [英] Scrape the snippet text from google search page

查看:109
本文介绍了从Google搜索页中抓取摘要文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们在Google中搜索问题时,通常会在如下代码段中给出答案:

我的目标是在我的python代码中抓取此文本(" 1961年8月4日",在屏幕截图中用红色标记圈出).

在尝试抓取文本之前,我使用以下代码将Web响应存储在文本文件中:

  page = requests.get("https://www.google.com/search?q=when+barak+obama+born")汤= BeautifulSoup(page.content,'html.parser')out_file =打开("web_response.txt","w",encoding ="utf-8")out_file.write(soup.prettify()) 

检查元素部分中,我注意到该代码段位于div类 Z0LcW XcVN5d 中(在屏幕快照中以绿色标记包围).但是,我的txt文件中的响应不包含此类文本,更不用说类名了.

我还尝试了此解决方案,其中作者抓取了ID为 rhs_block 的项目.但是我的回复中没有这样的ID.

我已经搜索了"1961年8月4日"的出现,在我的响应txt文件中,并试图理解它是否可能是摘要.但是这些事件似乎都不是我要找的.

我的计划是获取代码段的div ID或类名,并查找其内容,如下所示:

 #这是一个伪代码容器= soup.find_all(class或id ='somehting')用于容器中的标签:print(f"标记文字:{tag.text}") 

有什么办法吗?

注意:我也可以使用beautifulsoup和请求以外的库,只要它可以产生结果即可.

解决方案

硒将产生您需要的结果.之所以方便,是因为您可以添加任何等待并查看屏幕上实际发生的情况.

从硒导入Webdriver的

 来自selenium.webdriver.common.by导入方式从selenium.webdriver.common.keys导入密钥从selenium.webdriver.support.wait导入WebDriverWait从selenium.webdriver.support导入EC的预期条件驱动程序= webdriver.Chrome(executable_path ='/snap/bin/chromium.chromedriver')driver.get('https://google.com/')断言"Google"在driver.title中等待= WebDriverWait(驱动程序,20)wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,``.gLFyf.gsfi''))))input_field = driver.find_element_by_css_selector(.gLFyf.gsfi")input_field.send_keys(世界上有多少人")input_field.send_keys(Keys.RETURN)wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,``.Z0LcW.XcVN5d''))))结果= driver.find_element_by_css_selector(.Z0LcW.XcVN5d").text打印(结果)driver.close()driver.quit() 

结果可能会让您感到奇怪:)

您需要安装 Selenium Chromedriver .您需要将Chromedriver可执行文件放在Windows的路径中,或在Linux的路径中显示它的路径.我的示例适用于Linux.

When we search a question in google it often produces an answer in a snippet like the following:

My objective is to scrape this text ("August 4, 1961" encircled in red mark in the screenshot) in my python code.

Before trying to scrape the text, I stored the web response in a text file using the following code:

page = requests.get("https://www.google.com/search?q=when+barak+obama+born")
soup = BeautifulSoup(page.content, 'html.parser')
out_file = open("web_response.txt", "w", encoding='utf-8')
out_file.write(soup.prettify())

In the inspect element section, I noticed that the snippet is inside div class Z0LcW XcVN5d (encircled in green mark in the screenshot). However, the response in my txt file contains no such text, let alone class name.

I've also tried this solution where the author scraped items with id rhs_block. But my response contains no such id.

I've searched the occurrences of "August 4, 1961" in my response txt file and tried to comprehend whether it could be the snippet. But none of the occurences seemed to be the one that I was looking for.

My plan was to get the div id or class name of the snippet and find its content like this:

# IT'S A PSEUDO CODE
containers = soup.find_all(class or id = 'somehting')
for tag in containers:
    print(f"tag text : {tag.text}")

Is there any way to do this?

NOTE: I'm also okay with using libraries other than beautifulsoup and requests as long as it can produce result.

解决方案

Selenium will produce the result you need. It's convenient because you can add any waits and see what is actually going on on your screen.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')

driver.get('https://google.com/')
assert "Google" in driver.title
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".gLFyf.gsfi")))
input_field = driver.find_element_by_css_selector(".gLFyf.gsfi")
input_field.send_keys("how many people in the world")
input_field.send_keys(Keys.RETURN)

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".Z0LcW.XcVN5d")))
result = driver.find_element_by_css_selector(".Z0LcW.XcVN5d").text
print(result)
driver.close()
driver.quit()

The result will probably wonder you :)

You'll need to install Selenium and Chromedriver. You'll need to put Chromedriver executable in the path for Windows, or show the path to it for Linux. My example is for Linux.

这篇关于从Google搜索页中抓取摘要文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆