使用 Python 抓取谷歌搜索结果标题和网址 [英] Scrape google search results titles and urls using Python

查看：29 发布时间：2021/9/24 18:45:11 python html web-scraping beautifulsoup python-beautifultable

本文介绍了使用 Python 抓取谷歌搜索结果标题和网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Python(3.7) 开发一个项目，在该项目中我需要抓取标题和网址的前几个 Google 结果，我已经尝试使用 BeautifulSoup 但它不起作用:

这是我尝试过的:

导入请求从 my_fake_useragent 导入 UserAgent从 bs4 导入 BeautifulSoupua = 用户代理()google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)response = requests.get(google_url, {"User-Agent": ua.random})汤 = BeautifulSoup(response.text, "html.parser")result_div = 汤.find_all('div', attrs={'class': 'g'})链接 = []标题 = []描述 = []对于 result_div 中的 r:# 检查每个元素是否存在，否则引发异常尝试:链接 = r.find('a', href=True)title = r.find('h3', attrs={'class': 'r'}).get_text()description = r.find('span', attrs={'class': 'st'}).get_text()# 在追加之前检查以确保一切都存在如果链接 != '' 和标题 != '' 和描述 != '':links.append(link['href'])标题.附加(标题)descriptions.append(描述)# 如果一个元素不存在，则下一个循环除了:继续打印(标题)

但它没有返回任何东西.

当我尝试像这样获取 HTML 时:

url = 'https://google.com/search?q=python'响应 = requests.get(url)汤 = BeautifulSoup(response.content, 'lxml')打印(汤.美化())

这是它返回的内容:(添加了一个示例返回的 HTML 代码)


<div class="ZINbbc xpd O9g5cc uUPGi"><div><div class="jfp3ef"><a href="/url?q=https://www.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQFjAAegQIBxAB&amp;usg=AOvVaw0nCy-teBd7nOrThY5YG<div class="Bneawe vvjwJb AP7Wnd">Python.org
<div class="Bneawe UPmit AP7Wnd">https://www.python.org

</a>

<div class="NJM3tb">

<div class="jfp3ef"><div><div class="Bneawe s3v9rd AP7Wnd"><div><div><div class="Ap5OSd"><div class="Bneawe s3v9rd AP7Wnd">Python 编程语言的官方主页.

from selenium import webdriver从 bs4 导入 BeautifulSoup导入时间from bs4.element 导入标签driver = webdriver.Chrome('/usr/bin/chromedriver')google_url = "https://www.google.com/search?q=python";+&num="+ str(5)driver.get(google_url)时间.sleep(3)汤 = BeautifulSoup(driver.page_source,'lxml')result_div = 汤.find_all('div', attrs={'class': 'g'})链接 = []标题 = []描述 = []对于 result_div 中的 r:# 检查每个元素是否存在，否则引发异常尝试:链接 = r.find('a', href=True)标题 = 无标题 = r.find('h3')如果是实例(标题，标签):标题 = title.get_text()描述 = 无description = r.find('span', attrs={'class': 'st'})如果是实例(描述，标签):description = description.get_text()# 在追加之前检查以确保一切都存在如果链接 != '' 和标题 != '' 和描述 != '':links.append(link['href'])标题.附加(标题)descriptions.append(描述)# 如果一个元素不存在，则下一个循环除了作为 e 的例外:打印(e)继续打印(标题)打印(链接)打印(说明)

['欢迎来到 Python.org', '下载 Python |Python.org"、Python 教程 - W3Schools"、Python 简介 - W3Schools"、Python 编程语言 - GeeksforGeeks"、Python:您应该使用 Python 的 7 个重要原因 - 中"、Python:为什么要使用 Python 的 7 个重要原因"你应该使用 Python - Medium"、Python 教程 - Tutorialspoint"、Python 下载和安装说明"、Python 与 C++ - 找出 9 个重要差异 - eduCBA"、无、说明"]['https://www.python.org/'、'https://www.python.org/downloads/'、'https://www.w3schools.com/python/'、'https://www.w3schools.com/python/python_intro.asp'、'https://www.geeksforgeeks.org/python-programming-language/'、'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/' '/搜索NUM = 5&安培; q =&的Python放大器;棒= H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&安培; SA = X&安培;粘弹性阻尼器= 2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC'，' 的mailto:?体= Python的％20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26sndl%3D0%26source%x2Dsh%3Dsh%2Fx%2Fkp']['Python 编程语言的官方主页'，'正在寻找 Python 2.7?具体版本见下文.通过购买 PyCharm 许可证为 PSF 做出贡献.所有收益都有益于 PSF.Donate Now\xa0...', 'Python 可以在服务器上使用来创建 Web 应用程序....我们的展示 Python"工具使学习 Python 变得容易，它显示代码和结果.', 'Python 是什么?Python 是一种流行的编程语言.它由 Guido van Rossum 创建，并于 1991 年发布.它用于:web development\xa0...', 'Python 是一种广泛使用的通用、高级编程语言.它最初由 Guido van Rossum 于 1991 年设计，由 Python\xa0...', None, None, None, None, None, None, None]

import requests from my_fake_useragent import UserAgent from bs4 import BeautifulSoup ua = UserAgent() google_url = "https://www.google.com/search?q=python" + "&num=" + str(5) response = requests.get(google_url, {"User-Agent": ua.random}) soup = BeautifulSoup(response.text, "html.parser") result_div = soup.find_all('div', attrs={'class': 'g'}) links = [] titles = [] descriptions = [] for r in result_div: # Checks if each element is present, else, raise exception try: link = r.find('a', href=True) title = r.find('h3', attrs={'class': 'r'}).get_text() description = r.find('span', attrs={'class': 'st'}).get_text() # Check to make sure everything is present before appending if link != '' and title != '' and description != '': links.append(link['href']) titles.append(title) descriptions.append(description) # Next loop if one element is not present except: continue print(titles)

<div id="main"> <div class="ZINbbc xpd O9g5cc uUPGi"> <div> <div class="jfp3ef"> <a href="/url?q=https://www.python.org/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQFjAAegQIBxAB&usg=AOvVaw0nCy-teBd7nOrThY5YGQ4o"> <div class="BNeawe vvjwJb AP7Wnd"> Python.org </div> <div class="BNeawe UPmit AP7Wnd"> https://www.python.org </div> </a> </div> <div class="NJM3tb"> </div> <div class="jfp3ef"> <div> <div class="BNeawe s3v9rd AP7Wnd"> <div> <div> <div class="Ap5OSd"> <div class="BNeawe s3v9rd AP7Wnd"> The official home of the Python Programming Language. </div> </div> <div class="v9i61e"> <div class="BNeawe s3v9rd AP7Wnd"> <span class="BNeawe"> <a href="/url?q=https://www.python.org/downloads/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAXoECAcQAw&usg=AOvVaw0TKe6ApGOQcWuHcXIkvAT0"> <span class="XLloXe AP7Wnd"> Download Python </span> </a> </span> </div> </div> <div class="v9i61e"> <div class="BNeawe s3v9rd AP7Wnd"> <span class="BNeawe"> <a href="/url?q=https://www.python.org/about/gettingstarted/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAnoECAcQBQ&usg=AOvVaw03o9Qt-KFSbwECm8-wmUZS"> <span class="XLloXe AP7Wnd"> Python For Beginners </span> </a> </span> </div> </div> <div class="v9i61e"> <div class="BNeawe s3v9rd AP7Wnd"> <span class="BNeawe"> <a href="/url?q=https://www.python.org/doc/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwA3oECAcQBw&usg=AOvVaw3Yz3mO8HXGJoaf35qhyb3V"> <span class="XLloXe AP7Wnd"> Documentation </span> </a> </span> </div> </div> <div class="v9i61e"> <div class="BNeawe s3v9rd AP7Wnd"> <span class="BNeawe"> <a href="/url?q=https://docs.python.org/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBHoECAcQCQ&usg=AOvVaw0nY6NyZm0wErJJ1RIgTiPm"> <span class="XLloXe AP7Wnd"> Python Docs </span> </a> </span> </div> </div> <div class="v9i61e"> <div class="BNeawe s3v9rd AP7Wnd"> <span class="BNeawe"> <a href="/url?q=https://www.python.org/psf/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBXoECAcQCw&usg=AOvVaw3HoEDHmdRBcufXuwakPCAz"> <span class="XLloXe AP7Wnd"> Python Software Foundation </span> </a> </span> </div> </div> <div> <div class="BNeawe s3v9rd AP7Wnd"> <span class="BNeawe"> <a href="/url?q=https://www.python.org/downloads/release/python-373/&sa=U&ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBnoECAcQDQ&usg=AOvVaw3HsJpvpsCvYikd_mP7ndN3"> <span class="XLloXe AP7Wnd"> Python 3.7.3 </span> </a> </span> </div> </div> </div> </div> </div> </div> </div> </div> </div> </div>

from selenium import webdriver from bs4 import BeautifulSoup import time from bs4.element import Tag driver = webdriver.Chrome('/usr/bin/chromedriver') google_url = "https://www.google.com/search?q=python" + "&num=" + str(5) driver.get(google_url) time.sleep(3) soup = BeautifulSoup(driver.page_source,'lxml') result_div = soup.find_all('div', attrs={'class': 'g'}) links = [] titles = [] descriptions = [] for r in result_div: # Checks if each element is present, else, raise exception try: link = r.find('a', href=True) title = None title = r.find('h3') if isinstance(title,Tag): title = title.get_text() description = None description = r.find('span', attrs={'class': 'st'}) if isinstance(description, Tag): description = description.get_text() # Check to make sure everything is present before appending if link != '' and title != '' and description != '': links.append(link['href']) titles.append(title) descriptions.append(description) # Next loop if one element is not present except Exception as e: print(e) continue print(titles) print(links) print(descriptions)

['Welcome to Python.org', 'Download Python | Python.org', 'Python Tutorial - W3Schools', 'Introduction to Python - W3Schools', 'Python Programming Language - GeeksforGeeks', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python Tutorial - Tutorialspoint', 'Python Download and Installation Instructions', 'Python vs C++ - Find Out The 9 Important Differences - eduCBA', None, 'Description'] ['https://www.python.org/', 'https://www.python.org/downloads/', 'https://www.w3schools.com/python/', 'https://www.w3schools.com/python/python_intro.asp', 'https://www.geeksforgeeks.org/python-programming-language/', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/', '/search?num=5&q=Python&stick=H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&sa=X&ved=2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC', 'mailto:?body=Python%20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26shndl%3D0%26source%3Dsh%2Fx%2Fkp%26entrypoint%3Dsh%2Fx%2Fkp'] ['The official home of the Python Programming Language.', 'Looking for Python 2.7? See below for specific releases. Contribute to the PSF by Purchasing a PyCharm License. All proceeds benefit the PSF. Donate Now\xa0...', 'Python can be used on a server to create web applications. ... Our "Show Python" tool makes it easy to learn Python, it shows both the code and the result.', 'What is Python? Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for: web development\xa0...', 'Python is a widely used general-purpose, high level programming language. It was initially designed by Guido van Rossum in 1991 and developed by Python\xa0...', None, None, None, None, None, None, None]

使用 Python 抓取谷歌搜索结果标题和网址 [英] Scrape google search results titles and urls using Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用 Python 抓取谷歌搜索结果标题和网址 [英] Scrape google search results titles and urls using Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭