尝试使用 Python 和 Selenium 迭代滚动和抓取网页 [英] Trying to use Python and Selenium to scroll and scrape a webpage iteratively

查看:22
本文介绍了尝试使用 Python 和 Selenium 迭代滚动和抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近问了一个问题(参考这里:我已将我试图在下方抓取的数据定位到容器(以蓝色突出显示).

首先,我无法选择正确的元素来向下滚动页面,因为我以前从未这样做过.我相信我必须使用 selenium 来定位容器,然后使用execute_script"函数向下滚动页面,因为该表嵌入在网页正文中.但是我似乎无法让它发挥作用.

 scroll = driver.find_element_by_class_name("ag-body-viewport")driver.execute_script("arguments[0].scrollIntoView();", scroll)

其次,一旦我有了滚动的能力,我就需要一次向下滚动一点并反复刮擦.我的意思是,如果您查看图像,您会在图像内部看到一堆div"标签

例如...当页面加载时,我将 html 传递给 Beautifulsoup.我可以刮前 40 行.如果我向下滚动,比如说 40 行,然后我会将第 40 - 80 行传递给 beautifulsoup,第 1 - 40 行将不再可用,因为数据已动态更新...

长话短说,我想要的是能够抓取所提供图像中的所有内容,然后使用 selenium 向下滚动大约 40 行,抓取接下来的 40 行,然后向下滚动并抓取接下来的 40 行,依此类推... 关于如何让 selenium 在这个嵌入式容器中滚动的任何提示,以及如何迭代向下滚动以便在滚动时动态更新时捕获容器中的所有数据.任何额外的帮助将不胜感激.

解决方案

从我在屏幕截图上看到的,看起来您需要反复滚动到表格中最后一行的视图 -ag-row 类的最后一个元素:

导入时间而真:rows = driver.find_elements_by_css_selector("tr.ag-row")driver.execute_script("arguments[0].scrollIntoView();", rows[-1])时间.sleep(1)# TODO: 收集行

您还需要弄清楚循环退出条件.

I recently asked a question (referenced here: Python Web Scraping (Beautiful Soup, Selenium and PhantomJS): Only scraping part of full page) that helped to identify a problem I had with scraping all the contents of a page that dynamically updates when one scrolls. However I am still unable to wrangle my code to point to the correct element using selenium and scroll down the page iteratively. I also found that, when I manually scroll down the page in question some of the original content when the page loaded disappears while the new content updates. For example, look at the image below...

I have targeted the container with the data I am trying to scrape below (highlighted in blue).

First off I am having trouble selecting the right element to scroll down the page as I have never had to do this before. I believe I would have to use selenium to target the container and then use the "execute_script" function to then scroll down the page because this table is embedded within the body of the web page. However I can't seem to get that to work.

    scroll = driver.find_element_by_class_name("ag-body-viewport")
    driver.execute_script("arguments[0].scrollIntoView();", scroll)

Second, once I have the ability to scroll, I will need to scroll down a little at a time and scrape iteratively. What I mean is that, if you look in the image you will see a bunch of 'div' tags inside of the

For example... when the page loads and I pass the html to Beautifulsoup. I can scrape the first 40 rows. If I scroll down, say 40 rows, I will then pass row 40 - 80 to beautifulsoup and rows 1 - 40 will no longer be available as the data has dynamically updated...

Long story short, what I want is to be able to scrape all the content in the image provided then use selenium to scroll down roughly 40 rows, scrape the next 40, then scroll down and scrape the next 40 and so on... Any tips on how to get selenium to scroll in this embedded container and how would one go about scrolling down iteratively in order to capture all the data in the container when it dynamically updates as you scroll. Any extra help will be much appreciated.

解决方案

From what I see on the screenshot, it looks like you need to iteratively scroll into view of the last row in the table - the last element with ag-row class:

import time   

while True:
    rows = driver.find_elements_by_css_selector("tr.ag-row")
    driver.execute_script("arguments[0].scrollIntoView();", rows[-1])

    time.sleep(1)

    # TODO: collect the rows

You would also need to figure out the loop exit condition.

这篇关于尝试使用 Python 和 Selenium 迭代滚动和抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆