使用Selenium进行Web抓取无法捕获全文 [英] Web scraping with Selenium not capturing full text

查看:56
本文介绍了使用Selenium进行Web抓取无法捕获全文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Selenium/Python从链接列表中挖掘大量文本.

I'm trying to mine quite a bit of text from a list of links using Selenium/Python.

在此示例中,我只刮取了一页,并且成功抓取了全文:

In this example, I scrape only one of the pages and that successfully grabs the full text:

    page = 'https://xxxxxx.net/xxxxx/September%202020/2020-09-24'

driver = webdriver.Firefox()

driver.get(page)

elements = driver.find_element_by_class_name('text').text

elements

然后,当我尝试遍历整个链接列表(此页面上的所有每日链接:

Then, when I try to loop through the whole list of links (all the by day links on this page: https://overrustlelogs.net/Destinygg%20chatlog/September%202020) (using the same method that worked for grabbing the text from a single page), it is not grabbing the full text:

for i in tqdm(chat_links):
driver.get(i)
#driver.implicitly_wait(200)
elements = driver.find_element_by_class_name('text').text
#elements = driver.find_element_by_xpath('/html/body/main/div[1]/div[1]').text
#elements = elements.text
temp={'elements':elements}
chat_text.append(temp)

driver.close()

driver.close()

聊天文本

我的想法是,也许它没有机会加载整个内容,但是它可以在单个页面上运行.另外,driver.get方法似乎旨在加载整个给定页面.

My thought is that maybe it doesn't have the chance to load the whole thing, but it works on the single page. Also, the driver.get method seems meant to load the whole given page.

有什么想法吗?谢谢,非常感谢.

Any ideas? Thanks, much appreciated.

推荐答案

页面是延迟加载,您需要滚动页面并在列表中添加数据.

The page is lazy loading you need scroll the pages and add data in the list.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver=webdriver.Chrome()
driver.get("https://overrustlelogs.net/Destinygg%20chatlog/September%202020/2020-09-30")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".text>span")))
height=driver.execute_script("return document.body.scrollHeight")
data=[]
while True:
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
    time.sleep(1)
    for item in driver.find_elements_by_css_selector(".text>span"):
        if item.text in data:
            continue
        else:
            data.append(item.text)

    lastheight=driver.execute_script("return document.body.scrollHeight")
    if height==lastheight:
        break
    height=lastheight

print(data)

这篇关于使用Selenium进行Web抓取无法捕获全文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆