使用BeautifulSoup抓取Google搜索结果说明 [英] Scrape Google Search Result Description Using BeautifulSoup

查看:42
本文介绍了使用BeautifulSoup抓取Google搜索结果说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用BeautifulSoup抓取Google搜索结果说明,但无法抓取包含说明的标签.

I want to Scrape Google Search Result Description Using BeautifulSoup but I am not able to scrape the tag which is containing the description.

祖先:

html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe

儿童

em

Python代码:

from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re

search = input("Enter the search term:")
param = {"q": search}

r = requests.get("https://google.com/search?q=", params = param)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")

for t in title:
    print(t.get_text())

description = soup.findAll("span", class_ = "aCOpRe")

for d in description:
    print(d.get_text())

print("\n")
link = soup.findAll("a")

for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))

显示标记的图片链接

推荐答案

Google搜索结果的摘要(描述)的正确CSS选择器是 .aCOpRe span:not(.f).

The proper CSS selector for snippets (descriptions) of Google Search results is .aCOpRe span:not(.f).

这是在线IDE中的完整示例.

from bs4 import BeautifulSoup
import requests
import re

param = {"q": "coffee"}
headers = {
    "User-Agent":
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}

r = requests.get("https://google.com/search", params=param, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.select(".DKV0Md span")

for t in title:
    print(f"Title: {t.get_text()}\n")

snippets = soup.select(".aCOpRe span:not(.f)")

for d in snippets:
    print(f"Snippet: {d.get_text()}\n")

link = soup.findAll("a")

for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

输出

Title: Coffee - Wikipedia

Title: Coffee: Benefits, nutrition, and risks - Medical News Today

...

Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.

Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...

...

或者,您可以通过 SerpApi 从Google搜索中提取数据.

Alternatively, you can extract data from Google Search via SerpApi.

curl 示例

curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'

Python示例

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "location": "Sweden",
    "google_domain": "google.se",
    "gl": "se",
    "hl": "sv",
    "num": 100,
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Organic results")

for result in data['organic_results']:
    print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")

输出

Organic results

Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...


Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...

...

免责声明:我在SerpApi工作.

Disclaimer: I work at SerpApi.

这篇关于使用BeautifulSoup抓取Google搜索结果说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆