从代码获取所有href [英] Getting all href from a code

查看:96
本文介绍了从代码获取所有href的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个网络爬虫.为了在页面中查找链接,我在硒中使用了xpath

I'm making a web-crawler. For finding the links in a page I was using xpath in selenium

driver = webdriver.Firefox()
driver.get(side)
Listlinker = driver.find_elements_by_xpath("//a")

这很好.但是,在测试搜寻器时,我发现并非所有链接都位于标记下. href有时也会在area或div标签中使用.

This worked fine. Testing the crawler however, I found that not all links come under the a tag. href is sometimes used in area or div tags as well.

现在我被困住了

driver = webdriver.Firefox()
driver.get(side)
Listlinkera = driver.find_elements_by_xpath("//a")
Listlinkerdiv = driver.find_elements_by_xpath("//div")
Listlinkerarea = driver.find_elements_by_xpath("//area")

确实将爬网放入了Web爬网程序中.

which really puts the crawl in web-crawler.

我已经尝试过xpath "//@href",但这是行不通的.我还尝试了几种方法来有效地获取所有href url,都使用了漂亮的汤和lxml,但是到目前为止,都没有用.对不起,我没有任何代码可以显示我对美丽的汤和lxml所做的工作,但是由于这些代码无用,我删除了它们,我知道这不是最明智的做法.为了我自己,现在我已经开始保存这些失败的尝试,如果我想再次尝试,并且想知道第一次出了什么问题

I've tried xpath "//@href", but that doesn't work. I've also tried several ways to get all href url's in an efficient manner, both using beautiful soup and lxml, but so far, to no avail. I'm sorry I do not have any code to show for my efforts with beautiful soup and lxml, but as these proved useless, I deleted them, which isn't the smartest practice, I know. I have now started to save these unsuccessful attempts, for my own sake, if I ever want to try again, and want to know what went wrong the first time

在此方面能提供的任何帮助将不胜感激.

Any help I could get on this would be greatly appreciated.

推荐答案

尝试一下:

ListlinkerHref = driver.find_elements_by_xpath("//*[@href]")

这篇关于从代码获取所有href的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆