尝试使用Python和Selenium迭代地滚动和抓取网页 [英] Trying to use Python and Selenium to scroll and scrape a webpage iteratively

查看:198
本文介绍了尝试使用Python和Selenium迭代地滚动和抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近问过一个问题(在此引用:
我已使用数据定位容器我试图刮到下面(用蓝色突出显示)。



首先我无法选择正确的元素向下滚动页面,因为我从来没有这样做过之前。我相信我必须使用selenium来定位容器,然后使用execute_script函数然后向下滚动页面,因为这个表嵌入在网页的主体中。但是,我似乎无法让它工作。

  scroll = driver.find_element_by_class_name(ag-body-viewport)
driver.execute_script(arguments [0 ] .scrollIntoView();,滚动

其次,一旦我有滚动的能力,我需要一次向下滚动一下并反复刮擦。我的意思是,如果你查看图像,你会在

中看到一堆'div'标签。例如...当页面加载并传递html时去Beautifulsoup。我可以刮掉前40行。如果我向下滚动,说40行,我会将第40- 80行传递给beautifulsoup,因为数据已动态更新,所以第1-40行将不再可用...



长话短说,我想要的是能够刮掉所提供图像中的所有内容,然后使用硒向下滚动大约40行,刮下40行,然后向下滚动并刮下40个,依此类推。 ..有关如何让selenium在这个嵌入式容器中滚动的任何提示,以及如何在滚动时动态更新时捕获容器中的所有数据,以便迭代地向下滚动。任何额外的帮助将非常感激。

解决方案

从我在屏幕截图中看到的,看起来你需要迭代滚动到表格中最后一行的视图 - 最后一个元素 ag-row class:

 导入时间

而True:
rows = driver.find_elements_by_css_selector(tr.ag-row)
driver.execute_script (arguments [0] .scrollIntoView();,rows [-1])$ ​​b
$ b time.sleep(1)

#TODO:收集行

您还需要找出循环退出条件。


I recently asked a question (referenced here: Python Web Scraping (Beautiful Soup, Selenium and PhantomJS): Only scraping part of full page) that helped to identify a problem I had with scraping all the contents of a page that dynamically updates when one scrolls. However I am still unable to wrangle my code to point to the correct element using selenium and scroll down the page iteratively. I also found that, when I manually scroll down the page in question some of the original content when the page loaded disappears while the new content updates. For example, look at the image below...

I have targeted the container with the data I am trying to scrape below (highlighted in blue).

First off I am having trouble selecting the right element to scroll down the page as I have never had to do this before. I believe I would have to use selenium to target the container and then use the "execute_script" function to then scroll down the page because this table is embedded within the body of the web page. However I can't seem to get that to work.

    scroll = driver.find_element_by_class_name("ag-body-viewport")
    driver.execute_script("arguments[0].scrollIntoView();", scroll)

Second, once I have the ability to scroll, I will need to scroll down a little at a time and scrape iteratively. What I mean is that, if you look in the image you will see a bunch of 'div' tags inside of the

For example... when the page loads and I pass the html to Beautifulsoup. I can scrape the first 40 rows. If I scroll down, say 40 rows, I will then pass row 40 - 80 to beautifulsoup and rows 1 - 40 will no longer be available as the data has dynamically updated...

Long story short, what I want is to be able to scrape all the content in the image provided then use selenium to scroll down roughly 40 rows, scrape the next 40, then scroll down and scrape the next 40 and so on... Any tips on how to get selenium to scroll in this embedded container and how would one go about scrolling down iteratively in order to capture all the data in the container when it dynamically updates as you scroll. Any extra help will be much appreciated.

解决方案

From what I see on the screenshot, it looks like you need to iteratively scroll into view of the last row in the table - the last element with ag-row class:

import time   

while True:
    rows = driver.find_elements_by_css_selector("tr.ag-row")
    driver.execute_script("arguments[0].scrollIntoView();", rows[-1])

    time.sleep(1)

    # TODO: collect the rows

You would also need to figure out the loop exit condition.

这篇关于尝试使用Python和Selenium迭代地滚动和抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆