使用python3和硒获取表每一行中的下拉菜单链接 [英] Get links of dropdown menu in each row of the table with python3 and selenium

查看:91
本文介绍了使用python3和硒获取表每一行中的下拉菜单链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为python新手,我希望下载存档在网站上的旧报纸( http://digesto.asamblea.gob.ni/consultas/coleccion/),下面是我的脚本.

As a python novice I wish to download old newspaper archived on a website (http://digesto.asamblea.gob.ni/consultas/coleccion/) with my script below.

但是,我无法使脚本遍历表格的每一行,并在下拉菜单中选择"PDF",将相应的链接保存到列表中(以便下载).

However, I fail to get my script to go through each row of the table and select "PDF" in the dropdown menu saving the corresponding link to a list (in order to download them).

我的问题似乎是脚本无法使用提供的xpath从每个下拉菜单中找到PDF值.

My problem seems to be that the script cannot locate the PDF value from the each dropdown menu using the provided xpath.

这只是源代码中无法正常工作的一部分:

This just be the part of the source code which does not function:

table_id = driver.find_element(By.ID, 'gridTableDocCollection')
rows = table_id.find_elements(By.TAG_NAME, "tr") # get all table rows
for row in rows:
    elems = driver.find_elements_by_xpath('//ul[@class="dropdown-menu"]/a')
    for elem in elems:
        print(elem.get_attribute("href"))

当我使用此代码时:

list_of_links = driver.find_element_by_xpath('//ul[@class="dropdown-menu"]/li')
print(list_of_links)

我得到的是selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e6799ba5-5f0b-8b4f-817a-721326940b91", element="66c956f0-d813-a840-b24b-a12f92e1189b",而不是链接.我该怎么办?

I get selenium.webdriver.firefox.webelement.FirefoxWebElement (session="e6799ba5-5f0b-8b4f-817a-721326940b91", element="66c956f0-d813-a840-b24b-a12f92e1189b"instead of a link. What do I do wrong?

有人可以帮助我吗?我已经阅读了数小时的stackoverflow,但是在哪里都无法正常工作(请参阅注释掉的部分代码).

Can anyone please help me? I have read for hours through stackoverflow but where never able to get anything working (see part of the code which is commented out).

免责声明:使用脚本时,您需要手动键入验证码,而无需按Enter键即可继续执行脚本.

Disclaimer: when using the script you need to type the captcha by hand without pressing enter for the script to continue.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# A small script to download issues of the Gaceta de Nicaragua (1843-1960) 19758 issues

import logging
from selenium.webdriver.remote.remote_connection import LOGGER
LOGGER.setLevel(logging.WARNING)

import os
import sys
import time
import shutil
from subprocess import call
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains

profile = webdriver.FirefoxProfile() # profile to prevent download manager
profile.set_preference("network.cookie.cookieBehavior", 0) # accept all cookies
profile.set_preference("network.cookie.lifetimePolicy", 0) # accept cookies
profile.set_preference("network.cookie.alwaysAcceptSessionCookies", 1) # always allow sess
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", 'Downloads/')
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", 'image/jpeg;application/jpeg;image/jpg;application/jpg')

url = 'http://digesto.asamblea.gob.ni/consultas/coleccion/' # web page
print('Opening digesto.asamblea.gob.ni...')

driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url) # open url

driver.find_element_by_xpath('//*[@id="cavanzada"]').click() # advanced menu

driver.find_element_by_xpath("//select[@id='slcCollection']/option[text()='Diario Oficial']").click()
driver.find_element_by_xpath("//select[@id='slcMedio']/option[text()='Gaceta Oficial']").click() # change journal name here

inputElement = driver.find_element_by_xpath('//*[@id="txtDatePublishFrom"]')
inputElement.send_keys('01/01/1844') # change start date

inputElement = driver.find_element_by_xpath('//*[@id="txtDatePublishTo"]')
inputElement.send_keys('31/12/1860') # change end date

time.sleep( 5 ) # wait for Human Captcha Insertion

inputElement.send_keys(Keys.ENTER) # search

time.sleep( 2 ) # wait to load

select_element = Select(driver.find_element_by_xpath('//*[@id="slcResPage"]')) # page count
select_element.select_by_value('50') # max 50

time.sleep( 1 ) # wait to load

list_of_links = driver.find_elements_by_xpath('//ul[@class="dropdown-menu"]/a')
print(list_of_links)

#a=[];
#a = driver.find_elements_by_link_text("PDF");
#driver.find_element_by_link_text("PDF").click()
#a = driver.find_element_by_xpath("//select[@class='dropdown-menu']/option[text()='PDF']").click()
#a = driver.find_element_by_xpath('//*[contains(text(), '"dropdown-menu"')] | //*[@#='"PDF"']'); #[contains(@#, "PDF")]
#a = driver.find_elements_by_xpath("//*[contains(text(), 'PDF')]")
#a = driver.find_elements_by_xpath('//div[@class="dropdown-menu"][contains(@#, "PDF")]')
#print(a, sep='\n')
#print(*a, sep='\n')

#driver.find_element(By.CssSelector("a[title='Acciones']")).find_element(By.xpath(".//span[text()='PDF']")).click();

#select_element = Select(driver.find_element_by_xpath('//*[@id="gridTableDocCollection"]/html/body/div[3]/div[1]/div/div/form/div[3]/div[2]/table/tbody/tr[1]/td[5]/div/ul/li[1]/a'))
#select_element.select_by_text('PDF')

table_id = driver.find_element(By.ID, 'gridTableDocCollection')
rows = table_id.find_elements(By.TAG_NAME, "tr") # get all table rows
for row in rows:
    elems = driver.find_elements_by_xpath('//ul[@class="dropdown-menu"]/a')
    for elem in elems:
        print(elem.get_attribute("href"))

推荐答案

进一步了解您手动执行的步骤.现在,您已经开始遍历所有行,但是对"row"元素没有做任何事情.您需要单击该行的下拉菜单,然后选择PDF选项

think more about the steps you follow manually. right now, you've started a loop through all of the rows, but not done anything with the "row" element. you'll want to click on the dropdown for the row, then choose the PDF option

table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
    # click on the button to get the dropdowns to appear
    row.find_element_by_css_selector('button').click()
    # now find the one that's the pdf (here, using the fact that the onclick attribute of the link has the text "pdf")
    row.find_element_by_css_selector('li a[onclick*=pdf]').click()

从这里,您需要转到新窗口并下载pdf.尝试解决问题,然后在需要帮助的情况下提交新问题.

From here, you'll need to go to the new window and download the pdf. Try working that out, then if you need help submit a new question.

这篇关于使用python3和硒获取表每一行中的下拉菜单链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆