使用Selenium进行Web抓取无法捕获全文 [英] Web scraping with Selenium not capturing full text
问题描述
我正在尝试使用Selenium/Python从链接列表中挖掘大量文本.
I'm trying to mine quite a bit of text from a list of links using Selenium/Python.
在此示例中,我只刮取了一页,并且成功抓取了全文:
In this example, I scrape only one of the pages and that successfully grabs the full text:
page = 'https://xxxxxx.net/xxxxx/September%202020/2020-09-24'
driver = webdriver.Firefox()
driver.get(page)
elements = driver.find_element_by_class_name('text').text
elements
Then, when I try to loop through the whole list of links (all the by day links on this page: https://overrustlelogs.net/Destinygg%20chatlog/September%202020) (using the same method that worked for grabbing the text from a single page), it is not grabbing the full text:
for i in tqdm(chat_links):
driver.get(i)
#driver.implicitly_wait(200)
elements = driver.find_element_by_class_name('text').text
#elements = driver.find_element_by_xpath('/html/body/main/div[1]/div[1]').text
#elements = elements.text
temp={'elements':elements}
chat_text.append(temp)
driver.close()
driver.close()
聊天文本
我的想法是,也许它没有机会加载整个内容,但是它可以在单个页面上运行.另外,driver.get方法似乎旨在加载整个给定页面.
My thought is that maybe it doesn't have the chance to load the whole thing, but it works on the single page. Also, the driver.get method seems meant to load the whole given page.
有什么想法吗?谢谢,非常感谢.
Any ideas? Thanks, much appreciated.
推荐答案
页面是延迟加载,您需要滚动页面并在列表中添加数据.
The page is lazy loading you need scroll the pages and add data in the list.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver=webdriver.Chrome()
driver.get("https://overrustlelogs.net/Destinygg%20chatlog/September%202020/2020-09-30")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".text>span")))
height=driver.execute_script("return document.body.scrollHeight")
data=[]
while True:
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(1)
for item in driver.find_elements_by_css_selector(".text>span"):
if item.text in data:
continue
else:
data.append(item.text)
lastheight=driver.execute_script("return document.body.scrollHeight")
if height==lastheight:
break
height=lastheight
print(data)
这篇关于使用Selenium进行Web抓取无法捕获全文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!