在 Jupyter notebook 中使用 PySpark 读取 XML [英] Read XML using PySpark in Jupyter notebook

查看:204
本文介绍了在 Jupyter notebook 中使用 PySpark 读取 XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取 XML 文件:df = spark.read.format('com.databricks.spark.xml').load('/path/to/my.xml')并收到以下错误:

java.lang.ClassNotFoundException:找不到数据源:com.databricks.spark.xml

我尝试过:

  • 安装pyspark-xml

    $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.12:0.10.0

  • 使用配置运行 Spark:set jar_path = f'{SPARK_HOME}/jars/spark-xml_2.12-0.10.0.jar' spark = SparkSession.builder.config(conf=conf).config("spark.jars", jar_path).config("spark.executor.extraClassPath", jar_path).config("spark.executor.extraLibrary", jar_path).config("spark.driver.extraClassPath", jar_path).appName('my_app').getOrCreate()

  • 设置 evn 变量:os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.10.0 pyspark'

  • 下载jar文件并放入SPARK_HOME/jars

这里:

然后关闭会话.

sparkSesh.stop()

结束语:

  • 如果您想在 Jupyter 之外进行测试,只需转到命令行并执行

pyspark --packages com.databricks:spark-xml_2.12:0.12.0

您应该会看到它在 PySpark shell 中正确加载

  • 如果包版本与 scala 版本不匹配,您可能会收到此错误:异常:Java 网关进程在发送其端口号之前退出",这是一种非常有趣的方式解释包版本号错误
  • 如果您为用于构建 Spark 的 Scala 版本加载了错误的包,则在尝试读取 XML 时可能会出现此错误:py4j.protocol.Py4JJavaError: An error调用 o43.load 时发生.: java.lang.NoClassDefFoundError: scala/Product$class
  • 如果读取似乎有效但您得到一个空的数据框,则您可能指定了错误的根标记和/或行标记
  • 如果您需要支持多种读取类型(假设您还需要能够读取同一笔记本中的 Avro 文件),您可以列出多个包,并用逗号(无空格)分隔它们,如下所示:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.1.2 pyspark-壳'

  • 我的版本信息:Python 3.6.9、Spark 3.0.2

I am trying to read XML file: df = spark.read.format('com.databricks.spark.xml').load('/path/to/my.xml') and getting the following error:

java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml

I've tried to:

  • install pyspark-xml with

    $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.12:0.10.0
    
    

  • Run Spark with config: set jar_path = f'{SPARK_HOME}/jars/spark-xml_2.12-0.10.0.jar' spark = SparkSession.builder.config(conf=conf).config("spark.jars", jar_path).config("spark.executor.extraClassPath", jar_path).config("spark.executor.extraLibrary", jar_path).config("spark.driver.extraClassPath", jar_path).appName('my_app').getOrCreate()

  • Set evn variables: os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.10.0 pyspark'

  • Download jar file and put in into SPARK_HOME/jars

Here: https://github.com/databricks/spark-xml there is alternative solution for PySpark in paragraph "Pyspark notes", but I can't figure out how to read dataframe in order to pass it into function ext_schema_of_xml_df.

So, what else should I do to read XML with PySpark in JupyterLab?

解决方案

As you've surmised, the thing is to get the package loaded such that PySpark will use it in your context in Jupyter.

Start your notebook with your regular imports:

import pandas as pd
from pyspark.sql import SparkSession
import os

Before you instantiate your session, do:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0 pyspark-shell'

Notes:

  • the first part of the package version has to match the version of Scala that your spark was built with - you can find this out by doing spark-submit --version from the command line. e.g.

$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.2
      /_/
                        
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user centos on 2021-02-16T06:09:22Z
Revision 648457905c4ea7d00e3d88048c63f360045f0714
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.

The second part of the package version just has to be what has been provided for the given version of Scala - you can find that here: https://github.com/databricks/spark-xml - so in my case, since I had Spark built with Scala 2.12, the package I needed was com.databricks:spark-xml_2.12:0.12.0

Now instantiate your session:

# Creates a session on a local master
sparkSesh = SparkSession.builder.appName("XML_Import") \
    .master("local[*]").getOrCreate()

Find a simple .xml file whose structure you know - in my case I used the XML version of nmap output

thisXML = "simple.xml"

The reason for that is so that you can provide appropriate values for 'rootTag' and 'rowTag' below:

someXSDF = sparkSesh.read.format('xml') \
        .option('rootTag', 'nmaprun') \
        .option('rowTag', 'host') \
        .load(thisXML)

If the file is small enough, you can just do a .toPandas() to review it:

someXSDF.toPandas()[["address", "ports"]][:5]

Then close the session.

sparkSesh.stop()

Closing Notes:

  • if you want to test this outside of Jupyter, just go the command line and do

pyspark --packages com.databricks:spark-xml_2.12:0.12.0

you should see it load up properly in the PySpark shell

  • if the package version doesn't match up with the scala version, you might get this error: "Exception: Java gateway process exited before sending its port number" which is a pretty funny way to explain that a package version number is wrong
  • if you've loaded the wrong package for the version of Scala that was used to build your Spark, you'll likely get this error when you try to read the XML: py4j.protocol.Py4JJavaError: An error occurred while calling o43.load. : java.lang.NoClassDefFoundError: scala/Product$class
  • if the read seems work but you get an empty dataframe, you probably specified the wrong root tag and/or row tag
  • if you need to support multiple read types (let's say you also needed to be able to read Avro files in the same notebook), you would list multiple packages with commas (no spaces) separating them, like so:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.1.2 pyspark-shell'

  • My version info: Python 3.6.9, Spark 3.0.2

这篇关于在 Jupyter notebook 中使用 PySpark 读取 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆