使用python和selenium从html代码中获取url [英] Fetch url from html code with python and selenium
问题描述
In this question I was helped to address a dropdown menu in a table. However, I wish to fetch the url from the source code which is:
<a href="#" onclick="window.open('/consultas/util/pdf.php?type=rdd&rdd=nYgT5Rcvs2I%3D');return false;">PDF</a>
并将其存储在列表中,而不是像现在那样单击它.上面代码中的链接是/consultas/util/pdf.php?type=rdd&rdd=nYgT5Rcvs2I%3D
.但是,我需要在每个获取的链接http://digesto.asamblea.gob.ni
之前添加一个,以完成链接.
and store it in a list, instead of clicking on it as it is currently done. The link in the above code is /consultas/util/pdf.php?type=rdd&rdd=nYgT5Rcvs2I%3D
. However, I would need to add before each fetched link http://digesto.asamblea.gob.ni
to complete the link.
我该如何实现?
这是我当前的脚本,也是网站 http://digesto.asamblea.gob.ni/consultas/coleccion/:
This is my current script and this the website http://digesto.asamblea.gob.ni/consultas/coleccion/:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# A small script to download issues of the Gaceta de Nicaragua (1843-1960) 19758 issues
import logging
from selenium.webdriver.remote.remote_connection import LOGGER
LOGGER.setLevel(logging.WARNING)
import os
import sys
import time
import shutil
import urllib
from subprocess import call
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
profile = webdriver.FirefoxProfile() # profile to prevent download manager
profile.set_preference("network.cookie.cookieBehavior", 0) # accept all cookies
profile.set_preference("network.cookie.lifetimePolicy", 0) # accept cookies
profile.set_preference("network.cookie.alwaysAcceptSessionCookies", 1) # always allow sess
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.link.open_newwindow", 1) # open tabs in same window
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", 'Downloads/')
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", 'image/jpeg;application/jpeg;image/jpg;application/jpg')
url = 'http://digesto.asamblea.gob.ni/consultas/coleccion/' # web page
print('Opening digesto.asamblea.gob.ni...')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url) # open url
driver.find_element_by_xpath('//*[@id="cavanzada"]').click() # advanced menu
driver.find_element_by_xpath("//select[@id='slcCollection']/option[text()='Diario Oficial']").click()
driver.find_element_by_xpath("//select[@id='slcMedio']/option[text()='Gaceta Oficial']").click() # change journal name here
inputElement = driver.find_element_by_xpath('//*[@id="txtDatePublishFrom"]')
inputElement.send_keys('01/01/1844') # change start date
inputElement = driver.find_element_by_xpath('//*[@id="txtDatePublishTo"]')
inputElement.send_keys('31/12/1860') # change end date
time.sleep( 5 ) # wait for Human Captcha Insertion
inputElement.send_keys(Keys.ENTER) # search
time.sleep( 2 ) # wait to load
select_element = Select(driver.find_element_by_xpath('//*[@id="slcResPage"]')) # page count
select_element.select_by_value('50') # max 50
time.sleep( 1 ) # wait to load
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
row.find_element_by_css_selector('button').click()
row.find_element_by_css_selector('li a[onclick*=pdf]').click() # .get_attribute("href")
list_of_links = driver.current_url
driver.close() # quit() #close window
print(list_of_links)
免责声明::使用脚本时,您需要手动输入验证码,而无需按Enter键即可继续执行脚本.
Disclaimer: when using the script you need to type the captcha by hand without pressing enter for the script to continue.
推荐答案
以/
开头的相对链接来自顶级域,例如http://digesto.asamblea.gob.ni
在您的情况下;另一方面,如果不以此开头,则来自当前页面.在您抓取链接的循环内,将代码更改为此:
Relative links starting off with /
are from the top-level domain, e.g. http://digesto.asamblea.gob.ni
in your case; on the other hand, if they don't start with that, they are from the current page. Inside the loop where you're scraping the links, change the code to this:
list_of_links = [] # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url # for any links not starting with /
for row in rows:
row.find_element_by_css_selector('button').click()
link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("href")
if link.startswith('/'):
list_of_links.append(tld + link)
else:
list_of_links.append(current_url + link)
# at this point the dropdown will be visible, and will interfere with the next loop cycle
# click again in it, so the menu closes
row.find_element_by_css_selector('button').click()
print(list_of_links)
这篇关于使用python和selenium从html代码中获取url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!