WebScrape A Field - Selenium/BeautifulSoup [英] WebScrape A Field - Selenium/BeautifulSoup
问题描述
重新发布问题似乎仍然很出色-
Reposting as question still seems to be outstanding -
一个网站有几行标题.其中一些标题(标题为蓝色),点击后,展开并显示更多标题.附上一个例子.
A website has a few rows of titles. Some of these titles (where the titles are blue), when clicked, expand and show a few more titles. Attached is an example.
我的目标是执行抓取并提取所有标题、日期和时间.此外,如果可能,所有标题(第 1 行的示例是按需")
My goal is to perform a scrape and pull all the titles, date, and time. Also if possible, the header for all (an example for line 1 is where it says "On-demand")
当前代码- 存在一致性问题,无法收集所有下拉字段.
Current code- has consistency issues and cannot gather all drop down fields.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
print(sessiontitle)
ifDropdown=driver.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(8)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
time.sleep(4)
new_titles.add(title)
推荐答案
我已经尝试使用 beautifulsoup
提取您需要的数据.
I have tried pulling the data that you need using beautifulsoup
.
这会打印您需要的所有数据,包括下拉列表中的数据.
This prints all the data you need including the data from the drop-down.
import bs4 as bs
import requests
def scrape_sub_lists(s_url):
resp = requests.get(s_url)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
main_div = soup.find('div', class_='item-content')
divs = main_div.findAll('div', class_='card presentation')
print(f'\n***** Sublist Data *****\n')
for i in divs:
print(i.find('span', attrs = {'title': 'Session Name'}).text)
print(i.find('h4', class_='card-title').text.strip())
print(i.find('div', class_='details property-auto-width').find('div', class_='property').text)
print(f'\n\n')
print(f'\n***** End of Sublist Data *****\n')
url = 'https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p=1'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
divs = soup.findAll('div', class_='card item-container session')
print(len(divs))
for i in divs:
head = i.find('span', attrs = {'title': 'Location'})
if head is None:
head = i.find('span', attrs = {'title': 'Session Type'})
header = head.text.strip()
title = i.find('h4', class_='session-title card-title')
title_name = title.text.strip()
date = i.find('div', class_='internal_date').find('div', class_='property').text
time = i.find('div', class_='internal_time').find('div', class_='property').text
print(f'{header}\n{title_name}\n{date}\n{time}\n\n')
# Scraping the drop-down data
a_exists = title.find('a', attrs = {'class': 'item-expand-action expand'})
if a_exists:
scrape_sub_lists(a_exists['href'].strip())
请参阅下面的示例输出.***** Sublist Data *****
和 ***** End of Sublist Data *****
之间的内容包含来自 drop 的数据- 下方的项目.
See the sample output below. The contents between ***** Sublist Data *****
and ***** End of Sublist Data *****
contains the data from the drop-down of the item above it.
Sample Output
On-demand
Educational sessions on-demand
Thu, 16.09.2021
08:30 - 09:40
On-demand
Special Symposia on-demand
Thu, 16.09.2021
12:30 - 13:40
On-demand
Multidisciplinary sessions on-demand
Thu, 16.09.2021
16:30 - 17:40
Channel 3
Illumina - Diagnosing Non-Small Cell Lung Cancer using Comprehensive Genomic Profiling
Fri, 17.09.2021
08:45 - 10:15
***** Sublist Data *****
Industry Satellite Symposium
Illumina gives an update on their IVD road map
08:45 - 08:50
Industry Satellite Symposium
The impact of Comprehensive Genomic Profiling
08:50 - 09:01
Industry Satellite Symposium
A day in the life of a pathologist using Comprehensive Genomic Profiling
09:01 - 09:29
Industry Satellite Symposium
Dealing with complexity through Comprehensive Genomic Profiling
09:29 - 09:57
Industry Satellite Symposium
Q & A (Live)
09:57 - 10:15
***** End of Sublist Data *****
这篇关于WebScrape A Field - Selenium/BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!