抓取问题:“Inspect Element"与“查看页面源"不同 [英] Scraping problem: "Inspect Element" different from "View Page Source"

查看:35
本文介绍了抓取问题:“Inspect Element"与“查看页面源"不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个网页,其中包含多个标签.当我单击所需的选项卡并显示其内容后,首先会出现两个问题.1- 网页地址不变,所有选项卡都相同.2- 当我看到带有查看页面源代码"的页面源代码时;在浏览器(firefox 和 chrome)中,所有选项卡的页面源看起来也相同,而当我使用Inspect Elemnt"时,页面源看起来也相同.对于其中一个选项卡,我以所示代码的 html 形式看到我的目标内容.

I am trying to do web scraping a web page which includes multiple tabs inside itself. When I click on the desired tab and after showing up the its contents there are two problems at first. 1- The web page address does not change and is the same for all tabs. 2- When I see the page source with "view page source" of the browser (firefox and chrome), the page source is also looks same for all tabs whereas when I use "Inspect Elemnt" for one of the tabs I see my target content in the html form of the shown code.

问题是我无法通过在 WEB 世界中可用的用于网络抓取的 Python 典型代码访问所需选项卡的内容.这些代码通常基于 bs4.

The problem is I could not access the desired tab's contents via python typical codes for web scraping available all over the WEB world. These codes normally are based on bs4.

有没有人有任何想法或示例代码来学习如何处理我的问题?我正在查看的页面位于以下地址:http://tsetmc.com/Loader.aspx?ParTree=151311&i=63917421733088077#

Does anyone have any idea or sample code to learn how to handle my problem? The page I am looking is on the following address: http://tsetmc.com/Loader.aspx?ParTree=151311&i=63917421733088077#

推荐答案

如果页面包含 javascript DOM 元素,则使用 beautifullsoup 进行网页抓取将无法正确完成.您尝试抓取的页面具有 javascript 元素并显示数据.View Source 和 Inspect Element 之间的区别在于浏览器.实际上,浏览器使其对用户可读.综上所述,你要使用模拟浏览器来实现你要找的那些数据.这可以通过硒来完成.您可以搜索使用 selenium 和 python 进行网页抓取.

web scraping with beautifullsoup can not be done correctly if a page has javascript DOM element. the page your are trying to scrape has javascript element and shows data with that. The difference between View Source and Inspect Element is due to the browser. Actually the browser makes it readable for users. To sum up, you have to use simulate the browser to achieve those data you are looking for. This can be done by Selenium. you can search for using selenium and python for webscraping.

这是一个使用 selenium 和 python 进行网页抓取的简单示例:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException


url = 'http://tsetmc.com/Loader.aspx?ParTree=151311&i=63917421733088077#'

#firefox driver for selenium from: https://github.com/mozilla/geckodriver/releases

driver = webdriver.Firefox(executable_path=r'your-path\geckodriver.exe')
driver.get(url)

wait = WebDriverWait(driver, 10)

try:
    #wait for the page to load completely
    element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div[4]/form/div[3]/div[2]/div[1]/div[2]/div[1]/table/tbody")))
    time.sleep(1)
finally:
    driver.quit()

此代码将打开 firefox,您必须将目录放在 'your-path\geckodriver.exe' 部分.注意关于 geckodriver 的评论.你需要它来运行 selenium.

This code will open the firefox you have to put your directory in the 'your-path\geckodriver.exe' section. Pay attention to the comment which is about geckodriver. you need it for running selenium.

您可以搜索有关 Selenium 的更多信息.

you can search for more information about Selenium.

这篇关于抓取问题:“Inspect Element"与“查看页面源"不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆