javascript __doPostBack 的网页抓取在 td 中包含一个 herf [英] web scraping for javascript __doPostBack contain a herf in td
问题描述
我想抓取一个网站,即 https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=
使用 selenium,但我只能抓取一页,而不能抓取其他页面.
这里我使用硒
from selenium import webdriver从 selenium.webdriver.chrome.options 导入选项从 selenium.webdriver.support.ui 导入 WebDriverWaitfrom selenium.webdriver.common.by import By从 selenium.webdriver.support 导入 expected_conditions 作为 ECchromeOptions = webdriver.ChromeOptions()chromeOptions.add_experimental_option('useAutomationExtension', False)驱动程序 = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities())driver.get('https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=')WebDriverWait(driver, 20).until(EC.staleness_of(driver.find_element_by_xpath("//td/a[text()='2']")))driver.find_element_by_xpath("//td/a[text()='2']").click()numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td/a[text()='2']"))))打印(数量链接)对于 i 在范围内(numLinks):打印(在页面{}上执行您的抓取".格式(str(i + 1)))WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td/a[text()='2']/span//following::span[1]))).点击()驱动程序退出()
这里是html内容
<td><span>1</span></td><td><ahref="javascript:__doPostBack('dnn$ctr1535$UNSPSCSearch$gvDetailsSearchView','Page$2')"style="color:#333333;">2</a></td>
这会引发错误:
raise TimeoutException(message, screen, stacktrace)超时异常
爬取网站 https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=
使用 Seleniuma> 您可以使用以下定位器策略::>
代码块:
from selenium import webdriver从 selenium.webdriver.support.ui 导入 WebDriverWaitfrom selenium.webdriver.common.by import By从 selenium.webdriver.support 导入 expected_conditions 作为 ECchrome_options = webdriver.ChromeOptions()chrome_options.add_argument(开始最大化")driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\WebDrivers\chromedriver.exe')driver.get(https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=%27")为真:尝试:WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//以下::a[1]"))).click()打印(点击下一页")除了超时异常:打印(没有更多的页面")休息驱动程序退出()
控制台输出:
点击下一页点击进入下一页点击进入下一页...
说明:如果您观察 HTML DOM页码 位于
中,具有动态
id
属性,其中包含文本UNSPSCSearch_gvDetailsSearchView.此外,页码在 last中,它有一个子 .在子表中,当前页码位于保存键的
内.因此,要在 下一页编号上
click()
,您只需使用索引[1] 标识以下
标记]代码>.最后,由于元素具有
javascript:__doPostBack()
,您必须为所需的element_to_be_clickable()
引入 WebDriverWait.<块引用>您可以在 如何通过 Selenium 和 WebDriver 等待 JavaScript __doPostBack 调用
I want to scrape a website i.e. is
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=
using selenium but I am able to scrape only one page not other pages.Here I am using selenium
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC chromeOptions = webdriver.ChromeOptions() chromeOptions.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities()) driver.get('https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=') WebDriverWait(driver, 20).until(EC.staleness_of(driver.find_element_by_xpath("//td/a[text()='2']"))) driver.find_element_by_xpath("//td/a[text()='2']").click() numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td/a[text()='2']")))) print(numLinks) for i in range(numLinks): print("Perform your scraping here on page {}".format(str(i+1))) WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td/a[text()='2']/span//following::span[1]"))).click() driver.quit()
here is the html content
<td><span>1</span></td> <td><a href="javascript:__doPostBack ('dnn$ctr1535$UNSPSCSearch$gvDetailsSearchView','Page$2')" style="color:#333333;">2</a> </td>
This throws an error:
raise TimeoutException(message, screen, stacktrace) TimeoutException
解决方案To scrape the website
https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=
using Selenium you can use the following Locator Strategy:Code Block:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC chrome_options = webdriver.ChromeOptions() chrome_options.add_argument("start-maximized") driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\WebDrivers\chromedriver.exe') driver.get("https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=%27") while True: try: WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//following::a[1]"))).click() print("Clicked for next page") except TimeoutException: print("No more pages") break driver.quit()
Console Output:
Clicked for next page Clicked for next page Clicked for next page . . .
Explaination: If you observe the HTML DOM the page numbers are within a
<table>
with a dynamicid
attribute containing the text UNSPSCSearch_gvDetailsSearchView. Further the page numbers are within the last<tr>
which is having a child<table>
. With in the child table the current page number is within a<span>
which holds the key. So toclick()
on the next page number you just need to identify the following<a>
tag with index[1]
. Finally, as the element is havingjavascript:__doPostBack()
you have to induce WebDriverWait for the desiredelement_to_be_clickable()
.
You can find a detailed discussion in How do I wait for a JavaScript __doPostBack call through Selenium and WebDriver
这篇关于javascript __doPostBack 的网页抓取在 td 中包含一个 herf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文登录 关闭
扫码关注1秒登录发送“验证码”获取 | 15天全站免登陆