在动态内容和隐藏数据表上使用漂亮的汤进行Selenium Web报废 [英] Selenium Web Scrapping With Beautiful Soup on Dynamic Content and Hidden Data Table

查看:81
本文介绍了在动态内容和隐藏数据表上使用漂亮的汤进行Selenium Web报废的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

真的需要这个社区的帮助!

Really need help from this community!

我正在使用Selenium和Beautiful Soup对Python中的动态内容进行Web抓取. 问题是即使使用以下代码,定价数据表也无法解析为Python:

I am doing web scraping on Dynamic Content in Python by using Selenium and Beautiful Soup. The thing is the pricing data table can not be parsed to Python, even though using the following code:

html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')  

但是,后来发现,如果在使用上述代码之前单击网页上的查看所有价格"按钮,则可以将该数据表解析为python.

However, What I found later is that if I click on ' View All Prices' Button on the WebPage before using the above code, I can parse that data table into python.

我的问题将是如何在我的python中解析和访问那些隐藏的动态td标签信息,而无需使用Selenium来单击所有的查看所有价格"按钮,因为有这么多.

My Question would be How can I parse and get access to those hidden dynamic td tag info in my python without using Selenium to click on all the 'View All Prices' buttons, because there are so many.

我正在进行Web爬网的网站的URL是 https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122 , 所附图片是我需要的动态数据表中的html. 在此处输入图片描述

The url for the website I am doing the Web Scrapping on is https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122, and the attached picture is the html in terms of the dynamic data table which I need. enter image description here

真的很感谢这个社区的帮助!

Really appreciate the help from this community!

推荐答案

加载后应定位元素并采用arguments[0]而不是通过document

You should target the element after has loaded and take arguments[0] and not the entire page via document

html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')

这有2个实际案例:

该元素尚未加载到DOM中,您需要等待该元素:

the element is not yet loaded in the DOM and you need to wait for the element:

browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time

try:
    element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
    print "element is ready do the thing!"
    html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
    sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
    print "Somethings wrong!"   

2

该元素位于影子根目录中,您需要首先扩展影子根目录,可能不是您所处的情况,但是在此我将对其进行提及,因为它与将来的参考有关.例如:

2

the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup


def expand_shadow_element(element):
  shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
  return shadow_root

driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')

html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande

shadow_root1 = expand_shadow_element(root1)

html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup

这篇关于在动态内容和隐藏数据表上使用漂亮的汤进行Selenium Web报废的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆