python selenium 刮整个表 [英] python selenium scrape the whole table

查看:25
本文介绍了python selenium 刮整个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这段代码的目的是从一些链接中抓取一个数据表,然后把它变成一个pandas数据框.

The purpose of this code is to scrape a data table form a some links then turn it into a pandas data frame.

问题是这段代码只抓取了表格第一页的前 7 行,我想捕获整个表格.因此,当我尝试遍历表格页面时,出现错误.

The problem is that this code only scrapes the first 7 rows only which are in the first page of the table and I want to capture the whole table. so when i tried to loop over table pages, i got an error.

代码如下:

from selenium import webdriver

urls = open(r"C:\Users\Sayed\Desktop\script\sample.txt").readlines()
for url in urls:
    driver = webdriver.Chrome(r"D:\Projects\Tutorial\Driver\chromedriver.exe")
    driver.get(url)
    for item in driver.find_element_by_xpath('//*[contains(@id,"showMoreHistory")]/a'):
        driver.execute_script("arguments[0].click();", item)

    for table in driver.find_elements_by_xpath('//*[contains(@id,"eventHistoryTable")]//tr'):
        data = [item.text for item in table.find_elements_by_xpath(".//*[self::td or self::th]")]
        print(data)

这里是错误:

回溯(最近一次调用最后一次):

Traceback (most recent call last):

文件D:/Projects/Tutorial/ff.py",第 8 行,在对于 driver.find_element_by_xpath('//*[contains(@id,"showMoreHistory")]/a') 中的项目:

File "D:/Projects/Tutorial/ff.py", line 8, in for item in driver.find_element_by_xpath('//*[contains(@id,"showMoreHistory")]/a'):

TypeError: 'WebElement' 对象不可迭代

TypeError: 'WebElement' object is not iterable

推荐答案

查看以下脚本以从该网页获取整个表格.我在脚本中使用了硬编码延迟,这不是一个好习惯.但是,您始终可以定义 Explicit Wait 以使代码更健壮:

Check out the below script to get the whole table from that webpage. I've used harcoded delay within my script which is not a good practice. However, you can always define Explicit Wait to make the code more robust:

import time
from selenium import webdriver

url = 'https://www.investing.com/economic-calendar/investing.com-eur-usd-index-1155'

driver = webdriver.Chrome()
driver.get(url)
item = driver.find_element_by_xpath('//*[contains(@id,"showMoreHistory")]/a')
driver.execute_script("arguments[0].click();", item)
time.sleep(2)
for table in driver.find_elements_by_xpath('//*[contains(@id,"eventHistoryTable")]//tr'):
    data = [item.text for item in table.find_elements_by_xpath(".//*[self::td or self::th]")]
    print(data)

driver.quit()

要获取耗尽 show more 按钮以及定义 Explicit Wait 的所有数据,您可以尝试以下脚本:

To get all the data exhausting the show more button along with defining Explicit Wait you can try the below script:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.investing.com/economic-calendar/investing.com-eur-usd-index-1155'

driver = webdriver.Chrome()
driver.get(url)
wait = WebDriverWait(driver,10)

while True:
    try:
        item = wait.until(EC.visibility_of_element_located((By.XPATH,'//*[contains(@id,"showMoreHistory")]/a')))
        driver.execute_script("arguments[0].click();", item)
    except Exception:break

for table in wait.until(EC.visibility_of_all_elements_located((By.XPATH,'//*[contains(@id,"eventHistoryTable")]//tr'))):
    data = [item.text for item in table.find_elements_by_xpath(".//*[self::td or self::th]")]
    print(data)

driver.quit()

这篇关于python selenium 刮整个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆