使用 Python 抓取谷歌搜索结果标题和网址 [英] Scrape google search results titles and urls using Python

查看:29
本文介绍了使用 Python 抓取谷歌搜索结果标题和网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python(3.7) 开发一个项目,在该项目中我需要抓取标题和网址的前几个 Google 结果,我已经尝试使用 BeautifulSoup 但它不起作用:

这是我尝试过的:

导入请求从 my_fake_useragent 导入 UserAgent从 bs4 导入 BeautifulSoupua = 用户代理()google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)response = requests.get(google_url, {"User-Agent": ua.random})汤 = BeautifulSoup(response.text, "html.parser")result_div = 汤.find_all('div', attrs={'class': 'g'})链接 = []标题 = []描述 = []对于 result_div 中的 r:# 检查每个元素是否存在,否则引发异常尝试:链接 = r.find('a', href=True)title = r.find('h3', attrs={'class': 'r'}).get_text()description = r.find('span', attrs={'class': 'st'}).get_text()# 在追加之前检查以确保一切都存在如果链接 != '' 和标题 != '' 和描述 != '':links.append(link['href'])标题.附加(标题)descriptions.append(描述)# 如果一个元素不存在,则下一个循环除了:继续打印(标题)

但它没有返回任何东西.

当我尝试像这样获取 HTML 时:

url = 'https://google.com/search?q=python'响应 = requests.get(url)汤 = BeautifulSoup(response.content, 'lxml')打印(汤.美化())

这是它返回的内容:(添加了一个示例返回的 HTML 代码)

<div class="ZINbbc xpd O9g5cc uUPGi"><div><div class="jfp3ef"><a href="/url?q=https://www.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQFjAAegQIBxAB&amp;usg=AOvVaw0nCy​​-teBd7nOrThY5YG<div class="Bneawe vvjwJb AP7Wnd">Python.org

<div class="Bneawe UPmit AP7Wnd">https://www.python.org

</a>

<div class="NJM3tb">

<div class="jfp3ef"><div><div class="Bneawe s3v9rd AP7Wnd"><div><div><div class="Ap5OSd"><div class="Bneawe s3v9rd AP7Wnd">Python 编程语言的官方主页.

<div class="v9i61e"><div class="Bneawe s3v9rd AP7Wnd"><span class="BNeaw"><a href="/url?q=https://www.python.org/downloads/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAXoECAcQAw&amp;usg=AOvVaw0Tke6ApGOQcWuHc&amp;V<span class="XLloXe AP7Wnd">下载 Python</span></a></span>

<div class="v9i61e"><div class="Bneawe s3v9rd AP7Wnd"><span class="BNeaw"><a href="/url?q=https://www.python.org/about/gettingstarted/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAnoECAcQBQ&usg=AOvVaw03o9QbZmSKw";<span class="XLloXe AP7Wnd">Python初学者</span></a></span>

<div class="v9i61e"><div class="Bneawe s3v9rd AP7Wnd"><span class="BNeaw"><a href="/url?q=https://www.python.org/doc/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwA3oECAcQBw&amp;usg=AOvVaw3Ygtz3mO8Hxhyb3V"<span class="XLloXe AP7Wnd">文档</span></a></span>

<div class="v9i61e"><div class="Bneawe s3v9rd AP7Wnd"><span class="BNeaw"><a href="/url?q=https://docs.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBHoECAcQCQ&amp;usg=AOvVaw0nY6NyZm0wErJm<Ti<span class="XLloXe AP7Wnd">Python 文档</span></a></span>

<div class="v9i61e"><div class="Bneawe s3v9rd AP7Wnd"><span class="BNeaw"><a href="/url?q=https://www.python.org/psf/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBXoECAcQCw&amp;usg=AOvVaw3HoEDHmdRBcufXugt<a<span class="XLloXe AP7Wnd">Python软件基金会</span></a></span>

<div><div class="Bneawe s3v9rd AP7Wnd"><span class="BNeaw"><a href="/url?q=https://www.python.org/downloads/release/python-373/&sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBnoECAcQDQ&usg=AOvVaw3CvgtypYpN;<span class="XLloXe AP7Wnd">蟒蛇 3.7.3</span></a></span>

解决方案

你应该试试自动化 selenium 库.它允许您抓取动态呈现请求(js 或 ajax)页面数据.

from selenium import webdriver从 bs4 导入 BeautifulSoup导入时间from bs4.element 导入标签driver = webdriver.Chrome('/usr/bin/chromedriver')google_url = "https://www.google.com/search?q=python";+&num="+ str(5)driver.get(google_url)时间.sleep(3)汤 = BeautifulSoup(driver.page_source,'lxml')result_div = 汤.find_all('div', attrs={'class': 'g'})链接 = []标题 = []描述 = []对于 result_div 中的 r:# 检查每个元素是否存在,否则引发异常尝试:链接 = r.find('a', href=True)标题 = 无标题 = r.find('h3')如果是实例(标题,标签):标题 = title.get_text()描述 = 无description = r.find('span', attrs={'class': 'st'})如果是实例(描述,标签):description = description.get_text()# 在追加之前检查以确保一切都存在如果链接 != '' 和标题 != '' 和描述 != '':links.append(link['href'])标题.附加(标题)descriptions.append(描述)# 如果一个元素不存在,则下一个循环除了作为 e 的例外:打印(e)继续打印(标题)打印(链接)打印(说明)

O/P:

['欢迎来到 Python.org', '下载 Python |Python.org"、Python 教程 - W3Schools"、Python 简介 - W3Schools"、Python 编程语言 - GeeksforGeeks"、Python:您应该使用 Python 的 7 个重要原因 - 中"、Python:为什么要使用 Python 的 7 个重要原因"你应该使用 Python - Medium"、Python 教程 - Tutorialspoint"、Python 下载和安装说明"、Python 与 C++ - 找出 9 个重要差异 - eduCBA"、无、说明"]['https://www.python.org/'、'https://www.python.org/downloads/'、'https://www.w3schools.com/python/'、'https://www.w3schools.com/python/python_intro.asp'、'https://www.geeksforgeeks.org/python-programming-language/'、'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/' '/搜索NUM = 5&安培; q =&的Python放大器;棒= H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&安培; SA = X&安培;粘弹性阻尼器= 2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC',' 的mailto:?体= Python的%20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26sndl%3D0%26source%x2Dsh%3Dsh%2Fx%2Fkp']['Python 编程语言的官方主页','正在寻找 Python 2.7?具体版本见下文.通过购买 PyCharm 许可证为 PSF 做出贡献.所有收益都有益于 PSF.Donate Now\xa0...', 'Python 可以在服务器上使用来创建 Web 应用程序....我们的展示 Python"工具使学习 Python 变得容易,它显示代码和结果.', 'Python 是什么?Python 是一种流行的编程语言.它由 Guido van Rossum 创建,并于 1991 年发布.它用于:web development\xa0...', 'Python 是一种广泛使用的通用、高级编程语言.它最初由 Guido van Rossum 于 1991 年设计,由 Python\xa0...', None, None, None, None, None, None, None]

where '/usr/bin/chromedriver' selenium web 驱动程序路径.

下载 Chrome 浏览器的 selenium 网络驱动程序:

http://chromedriver.chromium.org/downloads

为 Chrome 浏览器安装网络驱动程序:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Selenium 教程:

https://selenium-python.readthedocs.io/

I'm working on a project using Python(3.7) in which I need to scrape the first few Google results for Titles and Urls, I have tried it using BeautifulSoup but it doesn't work:

Here's what I have tried:

import requests
from my_fake_useragent import UserAgent
from bs4 import BeautifulSoup

ua = UserAgent()

google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")

result_div = soup.find_all('div', attrs={'class': 'g'})

links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = r.find('h3', attrs={'class': 'r'}).get_text()
        description = r.find('span', attrs={'class': 'st'}).get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except:
        continue

print(titles)

But it doesn't return anything.

When I try to fetch the HTML like this:

url = 'https://google.com/search?q=python'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.prettify())

here's what it return: (Added a sample returned HTML code)

<div id="main">
   <div class="ZINbbc xpd O9g5cc uUPGi">
    <div>
     <div class="jfp3ef">
      <a href="/url?q=https://www.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQFjAAegQIBxAB&amp;usg=AOvVaw0nCy-teBd7nOrThY5YGQ4o">
       <div class="BNeawe vvjwJb AP7Wnd">
        Python.org
       </div>
       <div class="BNeawe UPmit AP7Wnd">
        https://www.python.org
       </div>
      </a>
     </div>
     <div class="NJM3tb">
     </div>
     <div class="jfp3ef">
      <div>
       <div class="BNeawe s3v9rd AP7Wnd">
        <div>
         <div>
          <div class="Ap5OSd">
           <div class="BNeawe s3v9rd AP7Wnd">
            The official home of the Python Programming Language.
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/downloads/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAXoECAcQAw&amp;usg=AOvVaw0TKe6ApGOQcWuHcXIkvAT0">
              <span class="XLloXe AP7Wnd">
               Download Python
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/about/gettingstarted/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAnoECAcQBQ&amp;usg=AOvVaw03o9Qt-KFSbwECm8-wmUZS">
              <span class="XLloXe AP7Wnd">
               Python For Beginners
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/doc/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwA3oECAcQBw&amp;usg=AOvVaw3Yz3mO8HXGJoaf35qhyb3V">
              <span class="XLloXe AP7Wnd">
               Documentation
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://docs.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBHoECAcQCQ&amp;usg=AOvVaw0nY6NyZm0wErJJ1RIgTiPm">
              <span class="XLloXe AP7Wnd">
               Python Docs
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/psf/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBXoECAcQCw&amp;usg=AOvVaw3HoEDHmdRBcufXuwakPCAz">
              <span class="XLloXe AP7Wnd">
               Python Software Foundation
              </span>
             </a>
            </span>
           </div>
          </div>
          <div>
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/downloads/release/python-373/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBnoECAcQDQ&amp;usg=AOvVaw3HsJpvpsCvYikd_mP7ndN3">
              <span class="XLloXe AP7Wnd">
               Python 3.7.3
              </span>
             </a>
            </span>
           </div>
          </div>
         </div>
        </div>
       </div>
      </div>
     </div>
    </div>
   </div>
</div>

解决方案

You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag

driver = webdriver.Chrome('/usr/bin/chromedriver')
google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
driver.get(google_url)
time.sleep(3)

soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})


links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = None
        title = r.find('h3')

        if isinstance(title,Tag):
            title = title.get_text()

        description = None
        description = r.find('span', attrs={'class': 'st'})

        if isinstance(description, Tag):
            description = description.get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except Exception as e:
        print(e)
        continue

print(titles)
print(links)
print(descriptions)

O/P:

['Welcome to Python.org', 'Download Python | Python.org', 'Python Tutorial - W3Schools', 'Introduction to Python - W3Schools', 'Python Programming Language - GeeksforGeeks', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python Tutorial - Tutorialspoint', 'Python Download and Installation Instructions', 'Python vs C++ - Find Out The 9 Important Differences - eduCBA', None, 'Description']
['https://www.python.org/', 'https://www.python.org/downloads/', 'https://www.w3schools.com/python/', 'https://www.w3schools.com/python/python_intro.asp', 'https://www.geeksforgeeks.org/python-programming-language/', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/', '/search?num=5&q=Python&stick=H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&sa=X&ved=2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC', 'mailto:?body=Python%20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26shndl%3D0%26source%3Dsh%2Fx%2Fkp%26entrypoint%3Dsh%2Fx%2Fkp']
['The official home of the Python Programming Language.', 'Looking for Python 2.7? See below for specific releases. Contribute to the PSF by Purchasing a PyCharm License. All proceeds benefit the PSF. Donate Now\xa0...', 'Python can be used on a server to create web applications. ... Our "Show Python" tool makes it easy to learn Python, it shows both the code and the result.', 'What is Python? Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for: web development\xa0...', 'Python is a widely used general-purpose, high level programming language. It was initially designed by Guido van Rossum in 1991 and developed by Python\xa0...', None, None, None, None, None, None, None]

where '/usr/bin/chromedriver' selenium web driver path.

Download selenium web driver for chrome browser:

http://chromedriver.chromium.org/downloads

Install web driver for chrome browser:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Selenium tutorial:

https://selenium-python.readthedocs.io/

这篇关于使用 Python 抓取谷歌搜索结果标题和网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆