Selenium/Webscrape 这个字段 [英] Selenium/Webscrape this field

查看:24
本文介绍了Selenium/Webscrape 这个字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码运行良好并打印所有行的标题但带有下拉列表的行.

My code runs fine and prints the title for all rows but the rows with dropdowns.

例如,如果点击,第 4 行有一个下拉列表.我实现了一个 try 理论上会启动下拉菜单,然后拉取标题.

For example, row 4 has a dropdown if clicked. I implemented a try which would in theory initiate the dropdown, to then pull the titles.

但是当我执行 click() 并尝试打印时,对于带有这些下拉菜单的行,它们不会打印.

But when i execute click() and try to print, for the rows with these drop downs, they are not printing.

预期输出 - 打印所有标题,包括下拉列表中的标题.

Expected output- Print all titles including the ones in dropdown.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')

productlist=soup.find_all('div',class_='card item-container session')
for property in productlist:
    sessiontitle=property.find('h4',class_='session-title card-title').text
    print(sessiontitle)
    try:
        ifDropdown=driver.find_elements_by_class_name('item-expand-action expand')
        ifDropdown.click()
        time.sleep(4)
        newTitle=driver.find_element_by_class_name('card-title').text
        print(newTitle)
    except:
        newTitle='none'

推荐答案

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_soup(content):
    return BeautifulSoup(content, 'lxml')


def my_filter(req, content):
    try:
        r = req.get(content['href'])
        soup = get_soup(r.text)
        return [x.text for x in soup.select('.card-title')[1:]]
    except TypeError:
        return 'N/A'


def main(url):
    with requests.Session() as req:
        for page in range(1, 2):
            print(f"Extracting Page# {page}\n")
            params = {
                "p": page
            }
            r = req.get(url, params=params)
            soup = get_soup(r.text)
            goal = {x.select_one('.session-title').text: my_filter(
                req, x.select_one('.item-expand-action')) for x in soup.select('.card')}
        df = pd.DataFrame(goal.items(), columns=['Title', 'Menu'])
        print(df)


main('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')

输出:

                                                Title                                               Menu
0                      Educational sessions on-demand                                                N/A
1                          Special Symposia on-demand                                                N/A
2                Multidisciplinary sessions on-demand                                                N/A
3   Illumina - Diagnosing Non-Small Cell Lung Canc...  [Illumina gives an update on their IVD road ma...
4   MSD - Homologous Recombination Deficiency: BRC...  [Welcome and Introductions, Homologous Recombi...
5   Servier - The clinical value of IDH inhibition...  [Isocitric dehydrogenase: an actionable geneti...
6   AstraZeneca - Redefining Breast Cancer – Biolo...  [Welcome and Opening, Redefining Breast Cancer...
7   ITM Isotopen Technologien München AG - A Globa...  [Welcome & Introduction, Changes in the Incide...
8   MSD - The Role of Biomarkers in Patient Manage...  [Welcome and Introductions, The Role of Pd-L1 ...
9   AstraZeneca - Re-evaluating the role of gBRCA ...  [Welcome and introduction, What do we know abo...
10  Novartis - Unmet needs in oncogene-driven NSCL...  [Welcome and introduction, Unmet needs in onco...
11                                    Opening session                                                N/A

这篇关于Selenium/Webscrape 这个字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆