努力使用硒来报废表 [英] Struggling to scrap a table using selenium

查看:116
本文介绍了努力使用硒来报废表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我很希望对此链接

为了报废我决定使用硒。

In order scrap I decided to use selenium.

我的第一次尝试是:

In my first try what I did was:

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html_source = self.driver.page_source
self.driver.quit()
BeautifulSoup(html_source, "html5lib")
table = soup.find('table', {'class': 'heavy-table ncpulse-fav-table ncpulse-sortable compressed-table'})
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')

但是会输出错误

'no tables found'

然后我尝试使用Expected_conditions类,因为在

Then I tried to make use of expected_conditions class because as I looked up in SO maybe the "Page Source was pulled out even before the child elements have completely rendered"

因此,我尝试了类似的方法,例如,甚至在子元素完全呈现之前,页面源就被拔出了。

Therefore I tried something like this:

driver.get(route)
element_present = expected_conditions.presence_of_element_located(
    (By.CLASS_NAME, 'heavy-table ncpulse-fav-table ncpulse-sortable compressed-table'))
WebDriverWait(driver, 20).until(element_present)
html_source = driver.page_source 
driver.quit()

但是这次它输出:

selenium.common.exceptions.TimeoutException: Message

因此,我的问题是:如何获得所需的输出?使用 expected_conditions 类有什么问题?背后的问题/前端技术是什么使得它很难被淘汰?

Therefore my questions are: How could I obtain the desired output? What am I doing wrong with the use of the expected_conditions class? What is the issue/front-end-technology behind that makes it such a struggle to scrap the table?

推荐答案

要提取表格表中作为< table> 的内容为 Angular 使用而不是 presence_of_element_located(),您必须为<$ c $引入 WebDriverWait c> visibility_of_element_located(),然后您可以使用以下定位器策略

To extract the contents from the table as the <table> is Angular based element using Selenium and python instead of presence_of_element_located() you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:


  • 使用 CSS_SELECTOR

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.heavy-table.ncpulse-fav-table.ncpulse-sortable.compressed-table"))).text)



  • 使用 XPATH

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='heavy-table ncpulse-fav-table ncpulse-sortable compressed-table']"))).text)
    



  • 注意:您必须添加以下导入:

  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

    $ b $的预期条件b


  • 控制台输出:

  • Console Output:

    AKTIE +/- +/-% SENESTE ÅTD% BUD UDBUD VOLUMEN OMSÆTNING MARKEDSVÆRDI TID
    Abn Amro Bank N.V. -0,32 -4,08% 7,48 -53,90% - - 7,9 mio 59,0 mio - 21:09
    Adyen 81,00 5,62% 1523,00 108% - - 954 082 1,5 mia - 21:09
    Aegon -0,08 -3,49% 2,16 -45,47% - - 17,4 mio 37,5 mio - 21:05
    Ahold Del 0,25 0,98% 25,65 19,74% - - 8,0 mio 204,1 mio - 21:05
    Akzo Nobel 0,14 0,16% 85,86 -3,16% - - 1,1 mio 90,6 mio - 21:06
    Arcelormittal Sa 0,08 0,66% 11,53 -26,26% - - 11,9 mio 137,3 mio - 21:09
    Asm International 0,35 0,29% 119,10 21,23% - - 403 117 48,0 mio - 21:07
    Asml Holding 1,50 0,49% 308,45 17,56% - - 2,3 mio 712,7 mio - 21:05
    Asr Nederland -0,22 -0,73% 29,76 -4,97% - - 740 781 22,0 mio - 21:05
    Dsm Kon 2,25 1,66% 138,20 21,52% - - 680 867 94,1 mio - 21:09
    Galapagos -1,45 -1,22% 117,70 -36,89% - - 475 793 56,0 mio - 21:05
    Heineken 0,74 0,94% 79,10 -15,50% - - 1,1 mio 88,0 mio - 21:05
    Imcd 1,85 1,80% 104,85 36,23% - - 922 391 96,7 mio - 21:05
    Ing Groep N.V. -0,19 -2,80% 6,60 -38,24% - - 43,4 mio 286,2 mio - 21:08
    Just Eat Takeaway 0,08 0,09% 91,70 11,56% - - 1,1 mio 100,2 mio - 21:09
    Kpn Kon -0,03 -1,54% 2,11 -15,04% - - 21,4 mio 45,1 mio - 21:05
    Nn Group -0,35 -1,06% 32,80 3,82% - - 2,4 mio 79,6 mio - 21:05
    Philips Kon -0,08 -0,20% 39,42 -9,42% - - 5,2 mio 205,9 mio - 21:05
    Prosus -1,52 -1,89% 78,74 18,35% - - 15,0 mio 1,2 mia - 21:09
    Randstad Nv -0,98 -2,09% 45,93 -15,63% - - 698 496 32,1 mio - 21:05
    Relx 0,00 0,03% 19,64 -10,24% - - 1,9 mio 36,6 mio - 21:06
    Royal Dutch Shella -0,24 -2,07% 11,45 -54,58% - - 21,1 mio 241,2 mio - 21:07
    Unibail-Rodamco-We -3,79 -10,53% 32,20 -75,02% - - 6,6 mio 213,2 mio - 21:07
    Unilever -1,04 -2,00% 50,98 2,00% - - 8,2 mio 417,5 mio - 21:08
    Wolters Kluwer -0,04 -0,05% 72,88 14,19% - - 803 644 58,6 mio - 21:05
    



  • 这篇关于努力使用硒来报废表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆