无法从布局复杂的表格中抓取三个字段 [英] Can't scrape three fields from a table with complicated layout

查看:101
本文介绍了无法从布局复杂的表格中抓取三个字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用python和硒创建了一个脚本,以从网站上可用的表中解析三个字段franking creditgross dividentfurther information.仅当使浏览器单击其中带有加号的 圆形黄色按钮 时,才会显示最后两个字段.

I've created a script in python together with selenium to parse three fields franking credit,gross divident and further information from a table available in a website. The last two fields are revealed only when the browser is made to click on a circular yellow button having plus sign within it.

但是,单击按钮时,它们变为红色,表示已显示信息.

However, when the buttons are clicked, they turn into red which indicates that the information got displayed.

我的脚本可以单击所有按钮,但不能从该表中抓取三个字段.

My script can click on all the buttons but it can't scrape the three fields from that table.

我已附上一张图片,向您展示它的真实外观.

I've attached an image to show you how it really looks like.

我知道如果我向此https://www.sharedividends.com.au/wp-content/custom/ajaxfile.php?code=MLT发送带有相关有效负载的帖子http请求,则可以获取json中的所有表格字段,但这不是我想要解决的方式.

I know if I send a post http requests with concerning payload to this https://www.sharedividends.com.au/wp-content/custom/ajaxfile.php?code=MLT, I can get all the tabular fields in json but that is not how I wanna solve this.

网站链接

我尝试过:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.sharedividends.com.au/mlt-dividend-history/"

driver = webdriver.Chrome()

driver.get(url)

table = driver.find_element_by_css_selector("#divTable")
driver.execute_script("arguments[0].scrollIntoView();",table)

for items in driver.find_elements_by_css_selector("td.sorting_1"):
    driver.execute_script("arguments[0].scrollIntoView();",items)
    items.click()

for elems in driver.find_elements_by_css_selector("#divTable tbody tr"):
    franking_credit = elems.find_elements_by_css_selector("td")[5].text
    gross_divident = elems.find_elements_by_css_selector("td")[6].text
    further_info = elems.find_elements_by_css_selector("td")[7].text
    print(franking_credit,gross_divident,further_info)

driver.quit()

当我运行上述脚本时,它会抛出此错误IndexError: list index out of range并指向franking_credit =这行.

Whe I run the above script it throws this error IndexError: list index out of range pointing at franking_credit = this line.

这是该表的外观.我已经在下面感兴趣的图像中标记了该表中的三个字段.

This is how that table looks like. I've marked the three fields in that table within the image below which I'm interested in.

图片链接

如何解析该表中的三个字段?

How can I parse the three fields from that table?

推荐答案

您将收到以下错误消息,因为在运行自动化脚本时,该脚本显示20行带有其他属性,而不是10行.请尝试以下代码.

You are getting following error because when run automation scripts it showing 20 rows with some other attribute instead of 10 rows.Try the following code.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.sharedividends.com.au/mlt-dividend-history/"

driver = webdriver.Chrome()

driver.get(url)

table = driver.find_element_by_css_selector("#divTable")
driver.execute_script("arguments[0].scrollIntoView();",table)

for items in driver.find_elements_by_css_selector("td.sorting_1"):
    driver.execute_script("arguments[0].scrollIntoView();",items)
    items.click()

for elems in driver.find_elements_by_css_selector("#divTable tbody tr[role='row']"):
    franking_credit = elems.find_elements_by_css_selector("td")[5].text
    gross_divident = elems.find_elements_by_css_selector("td")[6].get_attribute('textContent')
    further_info = elems.find_elements_by_css_selector("td")[7].get_attribute('textContent')
    print(franking_credit, gross_divident,further_info)

控制台上的输出:

$ 0.0446 $ 0.1486 10.4C FRANKED @ 30%; DRP NIL DISCOUNT

$ 0.0107 $ 0.0357 2.5C FRANKED@30%; SP ECIAL; DRP SUSP

$ 0.0386 $ 0.1286 9C FRANKED @ 30%; DR P NIL DISCOUNT

$ 0.0437 $ 0.1457 10.2C FRANKED @ 30%; DRP NIL DISCOUNT

$ 0.0377 $ 0.1257 8.8C FRANKED @ 30%; DRP NIL DISCOUNT

$ 0.0429 $ 0.1429 10C FRANKED @ 30%; D RP NIL DISCOUNT

$ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP NIL DISCOUNT

$ 0.0424 $ 0.1414 9.9C FRANKED @ 30%; DRP NIL DISCOUNT

$ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP

$ 0.0441 $ 0.1471 10.3C FR@30%;0.4C SP ECIAL;DRP;NIL DIS

这篇关于无法从布局复杂的表格中抓取三个字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆