WebScrape A Field - Selenium/BeautifulSoup [英] WebScrape A Field - Selenium/BeautifulSoup

查看:41
本文介绍了WebScrape A Field - Selenium/BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重新发布问题似乎仍然很出色-

Reposting as question still seems to be outstanding -

一个网站有几行标题.其中一些标题(标题为蓝色),点击后,展开并显示更多标题.附上一个例子.

A website has a few rows of titles. Some of these titles (where the titles are blue), when clicked, expand and show a few more titles. Attached is an example.

我的目标是执行抓取并提取所有标题、日期和时间.此外,如果可能,所有标题(第 1 行的示例是按需")

My goal is to perform a scrape and pull all the titles, date, and time. Also if possible, the header for all (an example for line 1 is where it says "On-demand")

当前代码- 存在一致性问题,无法收集所有下拉字段.

Current code- has consistency issues and cannot gather all drop down fields.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')

new_titles = set()

productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
    sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
    print(sessiontitle)
    ifDropdown=driver.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
    if(ifDropdown):
        ifDropdown[0].click()
        time.sleep(8)
        open_titles = driver.find_elements_by_class_name('card-title')
        for open_title in open_titles:
            title = open_title.text
            if(title not in new_titles):
                print(title)
                time.sleep(4)
                new_titles.add(title)

推荐答案

我已经尝试使用 beautifulsoup 提取您需要的数据.

I have tried pulling the data that you need using beautifulsoup.

这会打印您需要的所有数据,包括下拉列表中的数据.

This prints all the data you need including the data from the drop-down.

import bs4 as bs
import requests


def scrape_sub_lists(s_url):
    resp = requests.get(s_url)
    soup = bs.BeautifulSoup(resp.text, 'html.parser')
    main_div = soup.find('div', class_='item-content')
    divs = main_div.findAll('div', class_='card presentation')
    print(f'\n***** Sublist Data *****\n')
    for i in divs:
        print(i.find('span', attrs = {'title': 'Session Name'}).text)
        print(i.find('h4', class_='card-title').text.strip())
        print(i.find('div', class_='details property-auto-width').find('div', class_='property').text)
        print(f'\n\n')
    print(f'\n***** End of Sublist Data *****\n')


url = 'https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p=1'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'html.parser')

divs = soup.findAll('div', class_='card item-container session')
print(len(divs))


for i in divs:
    head = i.find('span', attrs = {'title': 'Location'})
    if head is None:
        head = i.find('span', attrs = {'title': 'Session Type'})
    header = head.text.strip()
    title = i.find('h4', class_='session-title card-title')
    title_name = title.text.strip()
    date = i.find('div', class_='internal_date').find('div', class_='property').text
    time = i.find('div', class_='internal_time').find('div', class_='property').text

    print(f'{header}\n{title_name}\n{date}\n{time}\n\n')

    # Scraping the drop-down data
    a_exists = title.find('a', attrs = {'class': 'item-expand-action expand'})
    if a_exists:
        scrape_sub_lists(a_exists['href'].strip())

请参阅下面的示例输出.***** Sublist Data ********** End of Sublist Data ***** 之间的内容包含来自 drop 的数据- 下方的项目.

See the sample output below. The contents between ***** Sublist Data ***** and ***** End of Sublist Data ***** contains the data from the drop-down of the item above it.

Sample Output

On-demand
Educational sessions on-demand
Thu, 16.09.2021
08:30 - 09:40


On-demand
Special Symposia on-demand
Thu, 16.09.2021
12:30 - 13:40


On-demand
Multidisciplinary sessions on-demand
Thu, 16.09.2021
16:30 - 17:40


Channel 3
Illumina - Diagnosing Non-Small Cell Lung Cancer using Comprehensive Genomic Profiling
Fri, 17.09.2021
08:45 - 10:15

***** Sublist Data *****

Industry Satellite Symposium
Illumina gives an update on their IVD road map
08:45 - 08:50

Industry Satellite Symposium
The impact of Comprehensive Genomic Profiling
08:50 - 09:01

Industry Satellite Symposium
A day in the life of a pathologist using Comprehensive Genomic Profiling
09:01 - 09:29

Industry Satellite Symposium
Dealing with complexity through Comprehensive Genomic Profiling
09:29 - 09:57

Industry Satellite Symposium
Q & A (Live)
09:57 - 10:15

***** End of Sublist Data *****

这篇关于WebScrape A Field - Selenium/BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆