使用硒从网站中提取文本 [英] Extracting text from a website using selenium

查看:96
本文介绍了使用硒从网站中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图找到一种从良好阅读页面中提取书摘的方法.不幸的是,尝试过美丽的汤/硒.

trying to find a way to extract the book's summary from the good reads page. Have tried Beautiful soup / Selenium, unfortunately to no avail.

链接:https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search = true& from_srp = true&qid = D19iQu7KWI& rank = 1

link:https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search=true&from_srp=true&qid=D19iQu7KWI&rank=1

代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
link='https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search=true&from_srp=true&qid=D19iQu7KWI&rank=1'
driver.get(link)
Description=driver.find_element_by_xpath("//div[contains(text(),'TextContainer')]")
#first TextContainer contains the sumary of the book
book_page = requests.get(link)
soup = BeautifulSoup(book_page.text, "html.parser")
print(soup)
Container = soup.find('class', class_='leftContainer')
print(Container)

错误:

容器为空+

NoSuchElementException:没有这样的元素:无法找到元素: {方法":"xpath",选择器":"//div [contains(text(),'TextContainer')]'"} (会话信息:chrome = 83.0.4103.116)

NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//div[contains(text(),'TextContainer')]"} (Session info: chrome=83.0.4103.116)

推荐答案

您可以像这样获得说明

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
...
driver.get("https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search=true&from_srp=true&qid=D19iQu7KWI&rank=1")
description = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'div#description span[style="display:none"]'))
)
print(description.get_attribute('textContent'))

我使用了 CSS选择器以获得包含完整说明的特定隐藏的span.我还使用了显式等待来给元素加载时间

I have utilised a CSS Selector to get the specific hidden span that contains the full description. I have also used an explicit wait to give the element time to load.

这篇关于使用硒从网站中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆