如何使用仅 XPath 的正则表达式模式抓取无 ID 的网站元素 [英] How to scrape ID-less website elements with XPath-only regex patterns

查看:48
本文介绍了如何使用仅 XPath 的正则表达式模式抓取无 ID 的网站元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有几个与 XPath 搜索中使用正则表达式相关的类似问题——但是,有些不是很启发给我,而其他人失败是因为我的具体问题.因此,对于可能遇到相同问题的未来用户,我发布了以下问题:

在 Python/Selenium 中使用一次调用,我希望能够一次抓取下面的所有元素(为了没有代码格式的可读性):

/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**1**]/div/div[2]/div[1]/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**2**]/div/div[2]/div[1]/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**3**]/div/div[2]/div[1]/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**4**]/div/div[2]/div[1]/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**5**]/div/div[2]/div[1]/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**6**]/div/div[2]/div[1]

请注意,匹配元素的数量在目标网站之间是可变的(可以超过 6 个,但至少为 1 个),并且关联元素没有分配特定的 ID(>排除许多解决方案在StackOverflow的其他地方解释过,根据我的理解).

我正在寻找的是这样的:

website = driver.get(URL)html = WebDriverWait(driver, 1).until(EC.presence_of_element_located((By.XPATH, "/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[[0-9]{1}]/div/div[2]/div[1]", 正则表达式 = True)))

不起作用的是:

website = driver.get(URL)html = WebDriverWait(driver, 1).until(EC.presence_of_element_located((By.XPATH, "/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[matchers['[0-9]{1}']]/div/div[2]/div[1]")))超时异常:消息:屏幕截图:可通过屏幕获取

如何抓取所有没有 ID 且 XPath 与 Python + Selenium 中的正则表达式模式匹配的网站元素?

解决方案

你不需要正则表达式,你需要谓词 [position()<=6].>

There are several similar questions related to the usage of regex in XPath searches -- However, some are not very illuminating to me, whereas others failed for my specific problem. Therefore and for future users that might come across the same, I post the following question:

Using one call in Python/Selenium, I want to be able to scrape all elements below at once (for readability without code formatting):

/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**1**]/div/div[2]/div[1]
/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**2**]/div/div[2]/div[1]
/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**3**]/div/div[2]/div[1]
/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**4**]/div/div[2]/div[1]
/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**5**]/div/div[2]/div[1]
/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[**6**]/div/div[2]/div[1]

Note that the number of matching elements is variable among target websites (can be more than 6, but at least one) and that the associated elements do not have a specific ID assigned (which excludes many solutions explained elsewhere on StackOverflow, according to my understanding).

What I am looking for is something like:

website = driver.get(URL)
html = WebDriverWait(driver, 1).until(EC.presence_of_element_located((By.XPATH, "/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[[0-9]{1}]/div/div[2]/div[1]", regex = True)))

What doesn't work is:

website = driver.get(URL)
html = WebDriverWait(driver, 1).until(EC.presence_of_element_located((By.XPATH, "/html/body/div[6]/div/div[1]/div/div[3]/div[2]/div[2]/div[matchers['[0-9]{1}']]/div/div[2]/div[1]")))
TimeoutException: Message: 
Screenshot: available via screen

How to scrape all website elements without ID whose XPath matches a regex pattern in Python + Selenium?

解决方案

You don't want a regex for this, you want the predicate [position()<=6].

这篇关于如何使用仅 XPath 的正则表达式模式抓取无 ID 的网站元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆