抓取网站时找不到带有“检查元素"的div [英] Can't find a div that exists with 'inspect element' while scraping a website

查看:29
本文介绍了抓取网站时找不到带有“检查元素"的div的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python脚本,可下载html页面.我正在寻找这个div:

I have a python script which downloads an html page. I'm looking for this div:

<data-a-target="clip-thumbnail-link"  

当我查看网页上的元素时,该div就在那里.但是它没有显示在我脚本的打印语句中

And that div is there when I inspect element on the webpage I see it. But its not showing up in my print statement in my script

from bs4 import BeautifulSoup
from urllib import urlopen

BASE_URL = "https://www.twitch.tv/lethalfrag/clips"

def get_category_links(section_url):
    html = urlopen(section_url).read()    
    soup = BeautifulSoup(html, "lxml")    
    print(soup)     

get_category_links(BASE_URL)

推荐答案

如果在页面源中搜索被检查的元素,则可以看到该元素已丢失.这告诉我们JavaScript会在页面加载后对其进行修改. urllib requests 无法运行JavaScript代码.因此,您必须使用 Selenium .

If you search for the inspected element in the page source, you can see that it is missing. This tells us that JavaScript is modifying the page after it loads. urllib or requests can't run the JavaScript code. So, you'll have to use Selenium.

有关安装和演示的信息,请阅读 https://pypi.python.org/pypi/selenium

For installation and demo, read this https://pypi.python.org/pypi/selenium

您需要使用明确等待以获取您要查找的元素

You need to use explicit wait in order to get the element you are looking for.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
driver.get('https://www.twitch.tv/lethalfrag/clips')
try:
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'clips-cards ')))
except TimeoutException:
    print('Page timed out after 10 secs.')
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.find('a', {'data-a-target': 'clip-thumbnail-link'})['href'])  

输出:

https://clips.twitch.tv/RealIgnorantHeronVoteYea

这篇关于抓取网站时找不到带有“检查元素"的div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆