Tableau Public 的 Python Selenium 网页抓取:如何将收藏夹分配给工作簿? [英] Python Selenium webscraping of Tableau Public: how to assign favourites to workbook?

查看:28
本文介绍了Tableau Public 的 Python Selenium 网页抓取:如何将收藏夹分配给工作簿?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经编写了我的第一个 Selenium 脚本来练习用 Python 进行网页抓取.这个想法是从 Tableau Public 个人资料中抓取所有工作簿、视图和收藏夹.我设法提取了这三个关键变量,但我不知道如何将收藏夹分配给各自的工作簿,因为并非所有工作簿都至少有一个收藏夹.

I have written my first Selenium script to practise webscraping in Python. The idea is to scrape all workbooks, views and favourites from a Tableau Public profile. I managed to extract those three key variables, but I don't know how to assign favourites to their respective workbooks since not all workbooks have at least one favourite.

例如,百老汇的斯凯勒"没有收藏夹,但如果我要匹配字典中的工作簿和收藏夹,它会提取下一个最佳值,即 4.

For example "Skyler on Broadway" has no favourites, but if I were to match workbooks and favourites in a dictionary, it would pull in the next best value, namely 4.

f.text != "" 只删除列表末尾的空值.

f.text != "" only removes empty values at the end of the list.

解决这个问题的最佳方法是什么?

What's the best way to approach this problem?

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome(executable_path=r',mypath')

driver.get("https://public.tableau.com/profile/skybjohnson#!/")

#load entire website:

while True:

   try:
       show_more = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "load-more-vizzes")))
       driver.find_element_by_id("load-more-vizzes")
       driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
       WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, "load-more-vizzes")))

   except Exception as e:
       print(e)
       break

#get workbook titles
titles = driver.find_elements_by_class_name("workbook-title")

workbook_titles = [i.text for i in titles if i.text != ""]
print(workbook_titles)

#get number of views per workbook
views = driver.find_elements_by_class_name('workbook-view-count')

workbook_views = [int(v.text.split()[0]) for v in views if v.text != ""]
print(workbook_views)

#get number of favourites per workbook
favs = driver.find_elements_by_xpath('//SPAN[@ng-bind="controller.workbook.numberOfFavorites"]')

workbook_favs = [f.text for f in favs if f.text != ""]
print(workbook_favs)

推荐答案

首先您可以获得所有可视化,然后获得儿童标题、视图和收藏夹.您还必须检查观看次数和收藏夹是否存在.您可以找到改进的滚动和获取观看次数(如果没有观看次数为 0)和收藏夹(如果没有观看次数为 0)的正确方法:

First you can get all Vizzes and then get children title, views and favorites. Also you have to check if views count and favorites are exist. You can find improved scroll and correct way to get views count (0 if no views) and favorites (0 if no favorites):

wait = WebDriverWait(driver, 10)
with driver:
    driver.get("https://public.tableau.com/profile/skybjohnson#!/")

    wait.until(EC.presence_of_element_located((By.ID, "load-more-vizzes")))
    while driver.find_element_by_id("load-more-vizzes").is_displayed():
        driver.execute_script("arguments[0].scrollIntoView()", driver.find_element_by_id("load-more-vizzes"))

    vizzes = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".viz-container li.media-viz")))
    for viz in vizzes:
        if not viz.is_displayed():
            continue

        title = viz.find_element_by_css_selector('[ng-bind="controller.workbook.title"]').text

        views_count_list = viz.find_elements_by_css_selector('[ng-bind="controller.workbook.viewCount"]')
        views_count = views_count_list[0].text if len(views_count_list) > 0 else 0

        number_of_favorites_list = viz.find_elements_by_css_selector('[ng-bind="controller.workbook.numberOfFavorites"]')
        number_of_favorites = number_of_favorites_list[0].text if len(number_of_favorites_list) > 0 else 0

        print(title, views_count, number_of_favorites)

这篇关于Tableau Public 的 Python Selenium 网页抓取:如何将收藏夹分配给工作簿?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆