如何在数据库中使用Selify，并访问和移动下载的文件到挂载存储中，并保持Chrome和ChromeDriver版本的同步？ [英] How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?

查看：14 发布时间：2022/4/11 14:59:36 python selenium pyspark databricks azure-databricks

本文介绍了如何在数据库中使用Selify，并访问和移动下载的文件到挂载存储中，并保持Chrome和ChromeDriver版本的同步？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我看过几篇关于使用%sh在数据库中使用Selify来安装Chrome驱动程序和Chrome的帖子。这对我来说很好，但当我需要下载文件时，我遇到了很多麻烦。文件可以下载，但我在Databricks的文件系统中找不到它。即使我在将Chrome实例化到Azure Blob存储上的挂载文件夹时更改了下载路径，下载后文件也不会放在那里。还有一个问题是，在不手动更改版本号的情况下自动保持Chrome浏览器和ChromeDriver的版本同步。

以下链接显示有相同问题但没有明确答案的人：

https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html

https://forums.databricks.com/questions/45388/selenium-in-databricks-with-add-experimental-optio.html

Is there a way to identify where the file gets downloaded in Azure Databricks when I do web automation using Selenium Python?

还有一些人根本就在努力让Selify正常运行： https://forums.databricks.com/questions/14814/selenium-in-databricks.html

不在路径中错误： https://webcache.googleusercontent.com/search?q=cache:NrvVKo4LLdIJ:https://stackoverflow.com/questions/57904372/cannot-get-selenium-webdriver-to-work-in-azure-databricks+&cd=5&hl=en&ct=clnk&gl=us

是否有明确的指南来指导在数据库上使用Selify和管理下载的文件？如何使Chrome浏览器和ChromeDriver版本自动保持同步？

推荐答案

以下是安装Selify、Chrome和ChromeDriver的指南。这还会在通过Selify下载文件后将其移动到挂载的存储中。每个数字应位于其自己的单元格中。

安装Selify

%pip install selenium

进行导入

import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

将最新的ChromeDriver下载到dBFS根存储/tmp/。Curl命令将获取最新的Chrome版本并存储在version变量中。注意$前面的转义。

%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

将文件解压缩到dBFS根目录下的新文件夹/tmp/。我尝试使用非根路径，但不起作用。

%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/

下载并安装最新的Chrome。

%sh
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

**步骤3-5可以合并为一个命令。还可以使用以下命令创建外壳脚本并将其用作初始文件，以便为集群进行配置，并且在使用使用临时集群的作业集群时特别有用，因为初始化脚本适用于所有工作节点，而不仅仅是驱动程序节点。这也会安装SelSelum，允许您跳过第一步。只需在新笔记本中粘贴一个单元格，运行，然后将您的init脚本指向dbfs:/init/init_selenium.sh。现在，每当集群或临时集群启动时，都会在作业开始运行之前在所有工作节点上安装Chrome、ChromeDriver和Selify。

%sh
# dbfs:/init/init_selenium.sh
cat > /dbfs/init/init_selenium.sh <<EOF
#!/bin/sh
echo Install Chrome and Chrome driver
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
pip install selenium
EOF
cat /dbfs/init/init_selenium.sh

配置您的存储帐户。示例是使用ADLSGen2的Azure Blob存储。

service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id":  service_principal_id,
       "fs.azure.account.oauth2.client.secret": service_principle_key,
       "fs.azure.account.oauth2.client.endpoint": directory,
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

配置您的装载位置并装载。

mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"

if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
  dbutils.fs.mount(
    source = storage,
    mount_point = mount_point,
    extra_configs = configs)
  print(mount_point + " has been mounted.")
else:
  print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")

创建实例化Chrome浏览器的方法。我需要在utils文件夹中加载指向mnt/container-data/utils/selenium的Cookie文件。确保参数相同(无沙箱、无头、禁用-dev-shm-用法)

def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
    """
    Instatiates a Chrome browser.

    Parameters
    ----------
    download_path : str
        The download path to place files downloaded from this browser session.
    chrome_driver_path : str
        The path of the chrome driver executable binary (.exe file).
    cookies_path : str
        The path of the cookie file to load in (.pkl file).
    url : str
        The URL address of the page to initially load.

    Returns
    -------
    Browser
        Returns the instantiated browser object.
    """
    
    options = Options()
    prefs = {'download.default_directory' : download_path}
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--start-maximized')
    options.add_argument('window-size=2560,1440')
    print(f"{datetime.now()}    Launching Chrome...")
    browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
    print(f"{datetime.now()}    Chrome launched.")
    browser.get(url)
    print(f"{datetime.now()}    Loading cookies...")
    cookies = pkl.load(open(cookies_path, "rb"))
    for cookie in cookies:
        browser.add_cookie(cookie)
    browser.get(url)
    print(f"{datetime.now()}    Cookies loaded.")
    print(f"{datetime.now()}    Browser ready to use.")
    return browser

安装浏览器。将下载位置设置为dBFS根文件系统/tmp/downloads。确保Cookie路径前面有/dbfs，以便完整的Cookie路径类似/dbfs/mnt/...

browser = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver",
    cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
    url="YOUR_URL"
)

进行您的导航和所需的任何下载。
可选：检查您的下载位置。在本例中，我下载了一个CSV文件，并将在下载的文件夹中搜索，直到找到该文件格式。

import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
    print(root)
    if any(".csv" in s for s in filenames):
        print(filenames)
        break

将文件从dBFS根tMP复制到您的挂载存储(/mnt/container-data/raw/)。您也可以在此操作过程中重命名。使用dbutils时，只能使用file:前缀访问根文件系统。

dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')

这篇关于如何在数据库中使用Selify，并访问和移动下载的文件到挂载存储中，并保持Chrome和ChromeDriver版本的同步？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在数据库中使用Selify，并访问和移动下载的文件到挂载存储中，并保持Chrome和ChromeDriver版本的同步？ [英] How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在数据库中使用Selify，并访问和移动下载的文件到挂载存储中，并保持Chrome和ChromeDriver版本的同步？ [英] How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭