网页抓取程序找不到我可以在浏览器中看到的元素 [英] Web scraping program cannot find element which I can see in the browser

查看:30
本文介绍了网页抓取程序找不到我可以在浏览器中看到的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取

文本形式的 HTML 源代码:

<div class="tw-c-text-alt"><a class="tw-full-width tw-interactive tw-link tw-link--button tw-link--hover-underline-none tw-link--inherit" data-a-target="preview-card-title-link" href="/weplayesport_en"><div class="tw-align-items-start tw-flex"><h3 class="tw-ellipsis tw-font-size-5" title="NAVI vs HellRaisers | BO5 | ODPixel &amp; S4 | WeSave! Charity Play">NAVI vs HellRaisers |BO5 |ODPixel &amp;S4 |我们保存!慈善游戏</h3>

</a>

这是我的代码:

导入请求从 bs4 导入 BeautifulSoupreq = requests.get("https://www.twitch.tv/directory/game/Dota%202")汤 = BeautifulSoup(req.content, "lxml")title_elems = soup.find_all("h3", attrs={"title": True})打印(title_elems)

当我运行它时,title_elems 只是一个空列表 ([]).

为什么我的程序找不到元素?

解决方案

您感兴趣的元素是动态生成的,在初始页面加载后,这意味着您的浏览器执行了 JavaScript,发出了其他网络请求等为了构建页面.Requests 只是一个 HTTP 库,因此不会做这些事情.

您可以使用 Selenium 之类的工具,甚至可以分析网络流量以获取所需数据并直接发出请求.

I am trying to get the titles of the streams on https://www.twitch.tv/directory/game/Dota%202, using Requests and BeautifulSoup. I know that my search criteria are correct, yet my program does not find the elements I need.

Here is a screenshot showing the relevant part of the source code in the browser:

The HTML source as text:

<div class="tw-media-card-meta__title">
  <div class="tw-c-text-alt">
    <a class="tw-full-width tw-interactive tw-link tw-link--button tw-link--hover-underline-none tw-link--inherit" data-a-target="preview-card-title-link" href="/weplayesport_en">
      <div class="tw-align-items-start tw-flex">
        <h3 class="tw-ellipsis tw-font-size-5" title="NAVI vs HellRaisers | BO5 | ODPixel &amp; S4 | WeSave! Charity Play">NAVI vs HellRaisers | BO5 | ODPixel &amp; S4 | WeSave! Charity Play</h3>
      </div>
    </a>
  </div>
</div>

Here is my code:

import requests
from bs4 import BeautifulSoup

req = requests.get("https://www.twitch.tv/directory/game/Dota%202")

soup = BeautifulSoup(req.content, "lxml")

title_elems = soup.find_all("h3", attrs={"title": True})

print(title_elems)

When I run it, title_elems is just the empty list ([]).

Why is my program not finding the elements?

解决方案

The element you're interested in is dynamically generated, after the initial page load, which means that your browser executed JavaScript, made other network requests, etc. in order to build the page. Requests is just an HTTP library, and as such will not do those things.

You could use a tool like Selenium, or perhaps even analyze the network traffic for the data you need and make the requests directly.

这篇关于网页抓取程序找不到我可以在浏览器中看到的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆