创建特定于数据帧的模式:以大写字母开头的StructField [英] creating dataframe specific schema : StructField starting with capital letter

查看：107 发布时间：2021/4/13 20:25:59 python pyspark schema azure-databricks pyspark-dataframes

本文介绍了创建特定于数据帧的模式:以大写字母开头的StructField的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

为冗长的帖子表示歉意，看似简单，但我想提供完整的背景信息...

Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context...

在Databricks中，我将基于特定的架构定义创建一个行"数据，然后将该行插入到一个空的数据框中(也基于相同的特定架构).

In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema).

模式定义如下:

myschema_xb = StructType(
  [
    StructField("_xmlns", StringType(), True),
    StructField("_Version", DoubleType(), True),
    StructField("MyIds",
      ArrayType(
        StructType(
          [
            StructField("_ID", StringType(), True),
            StructField("_ID_Context", StringType(), True),
            StructField("_Type", LongType(), True),
          ]
        ),
        True
      ),
      True
    ),
  ]
)

行条目因此是:

myRow = Row(
    _xmlns="http://some.where.com",
    _Version=12.3,
    MyIds=[
        Row(
          _ID="XY",
          _ID_Context="Exxwhy",
          _Type=9
        ),
        Row(
          _ID="9152",
          _ID_Context="LNUMB",
          _Type=21
        ),
    ]
)

最后，databricks笔记本代码为:

Lastly, the databricks notebook code is:

mydf = spark.createDataFrame(sc.emptyRDD(), myschema_xb)
rows = [myRow]
rdf = spark.createDataFrame(rows, myschema_xb)
appended = mydf.union(rdf)

对 rdf = spark.createDataFrame(rows，myschema_xb)的调用会导致异常:

ValueError:带有StructType的意外元组'h'.


ValueError: Unexpected tuple 'h' with StructType.
现在我很想知道的部分是，如果我将元素 MyIds 更改为 myIds (即首字母小写)，则代码可以正常工作，而我的新数据框(附加)具有单行数据.
Now the part I am curious about is if I change the element MyIds to myIds (ie. lower case the first letter), the code works, and my new dataframe (appended) has the single row of data.
此例外是什么意思&为什么在更改元素大小写时它消失了?
What is this exception mean & why does it go away when I change the case of my element?
(仅供参考，我们的databricks运行时环境为Scala 2.11)
(FYI, our databricks runtime environment is Scala 2.11)
谢谢.
推荐答案
问题应该出在
行可用于通过使用命名参数来创建行对象，这些字段将按名称排序.

  Row can be used to create a row object by using named arguments, the fields will be sorted by names.

在 myschema_xb 中，这三列是按 [_ xmlns，_Version，MyIds] 的顺序定义的.当使用键(_ xmlns，_Version，MyIds)定义myRow时，生成的实际Row对象将是:
In myschema_xb, the three columns are defined in the order [_xmlns, _Version, MyIds]. When you define myRow with the keys: (_xmlns, _Version, MyIds), the actual Row object generated will be:
Row(MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)], _Version=12.3, _xmlns='http://some.where.com')

其中 MyIds 移至第一列，这与架构不匹配，因此会产生ERROR.当您使用小写的列名 myIds 时，Row对象中的键按具有的 ['_ Version'，'_ xmlns'，'myIds'] 进行排序右侧列中的myIds ，但是切换了 _Version 和 _xmls .这不会产生错误，因为简单的数据类型可以通过类型转换进行传递，但是结果数据帧不正确.
Which has MyIds moved to the first column and this does not match the schema and thus yields ERROR. While when you use lowercase column-name myIds, the keys in Row object are sorted as ['_Version', '_xmlns', 'myIds'] which had myIds in the right column, but _Version and _xmls switched. This does not yield error since simple datatype can pass through the typecasting, but the resulting dataframe is incorrect.
要解决此问题，您应该设置一个类似Row的类并自定义键的顺序，以确保字段的顺序与模式中显示的顺序完全匹配:
To overcome this issue, you should set up a Row-like class and customize the order of keys to make sure the order of fields matches exactly with those shown in your schema:
from pyspark.sql import Row

MyOuterROW = Row('_xmlns', '_Version', 'MyIds')
MyInnerRow = Row('_ID', '_ID_Context', '_Type')

myRow = MyOuterROW( 
    "http://some.where.com", 
    12.3, 
    [ 
        MyInnerROW("XY", "Exxwhy", 9), 
        MyInnerROW("9152", "LNUMB", 21) 
    ] 
)              
print(myRow)
#Row(_xmlns='http://some.where.com', _Version=12.3, MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)])

rdf = spark.createDataFrame([myRow], schema=myschema_xb)


                        这篇关于创建特定于数据帧的模式:以大写字母开头的StructField的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

创建特定于数据帧的模式:以大写字母开头的StructField [英] creating dataframe specific schema : StructField starting with capital letter

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

创建特定于数据帧的模式:以大写字母开头的StructField [英] creating dataframe specific schema : StructField starting with capital letter

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭