我的脚本从无限滚动的网页中一次又一次地解析所有链接 [英] My script parses all the links again and again from a infinite scrolling webpage

查看:101
本文介绍了我的脚本从无限滚动的网页中一次又一次地解析所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用python和selenium结合编写了一个脚本,以从网页中获取公司的所有链接,直到向下滚动才显示所有链接.但是,当我运行脚本时,会得到所需的链接,但是有很多重复项被抓取.在这一点上,我不知道如何修改脚本以获取唯一链接.到目前为止,这是我尝试过的:

I've written a script using python in combination with selenium to get all the company links from a webpage which doesn't display all the links until scrolled downmost. However, when I run my script, I get desired links but there are lots of duplicates being scraped along. At this point, I can't get any idea how can I modify my script to get the unique links. Here is what I've tried so far:

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

    for items in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]"):
        item = items.find_elements_by_xpath('.//a')[0]
        print(item.get_attribute("href"))

driver.close()

推荐答案

我不了解python,但我确实知道您做错了什么.希望您能够自己弄清楚代码;)

I don't know python but I do know what you are doing wrong. Hopefully you'll be able to figure out the code for yourself ;)

每次向下滚动时,页面上都会添加50个链接,直到有1000个链接.好吧,几乎...它从20个链接开始,然后依次增加30个和50个,直到有1000个.

Every time you scroll down 50 links are added to the page until there are 1000 links. Well almost... it starts with 20 links and then adds 30 and then 50 each time until there are 1000.

现在打印代码的方式:

前20个链接.

第1个20再加上下一个30.

The 1st 20 again + the next 30.

第一个50 +接下来的50.

The 1st 50 + the next 50.

依此类推...

您真正想要做的就是向下滚动页面,直到页面上所有链接都被链接,然后然后打印它们.希望有帮助.

What you actually want to do is just scroll down the page until you have all the links on the page and then print them. Hope that helps.

这是更新的Python代码(我已经检查了它并且有效)

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')


while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    listElements = driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")
    print(len(listElements))
    if (len(listElements) == 1000):
        break

for item in listElements:
    print(item.get_attribute("href"))

driver.close()

如果您希望它更快地工作,可以将"time.sleep(5)"换成安德森的等待语句

If you want it to work a bit faster you could swap out the "time.sleep(5)" for Anderson's wait statement

这篇关于我的脚本从无限滚动的网页中一次又一次地解析所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆