创建特定于数据帧的模式:以大写字母开头的StructField [英] creating dataframe specific schema : StructField starting with capital letter

查看:107
本文介绍了创建特定于数据帧的模式:以大写字母开头的StructField的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为冗长的帖子表示歉意,看似简单,但我想提供完整的背景信息...

Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context...

在Databricks中,我将基于特定的架构定义创建一个行"数据,然后将该行插入到一个空的数据框中(也基于相同的特定架构).

In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema).

模式定义如下:

myschema_xb = StructType(
  [
    StructField("_xmlns", StringType(), True),
    StructField("_Version", DoubleType(), True),
    StructField("MyIds",
      ArrayType(
        StructType(
          [
            StructField("_ID", StringType(), True),
            StructField("_ID_Context", StringType(), True),
            StructField("_Type", LongType(), True),
          ]
        ),
        True
      ),
      True
    ),
  ]
)

行条目因此是:

myRow = Row(
    _xmlns="http://some.where.com",
    _Version=12.3,
    MyIds=[
        Row(
          _ID="XY",
          _ID_Context="Exxwhy",
          _Type=9
        ),
        Row(
          _ID="9152",
          _ID_Context="LNUMB",
          _Type=21
        ),
    ]
)

最后,databricks笔记本代码为:

Lastly, the databricks notebook code is:

mydf = spark.createDataFrame(sc.emptyRDD(), myschema_xb)
rows = [myRow]
rdf = spark.createDataFrame(rows, myschema_xb)
appended = mydf.union(rdf)

rdf = spark.createDataFrame(rows,myschema_xb)的调用会导致异常:

ValueError:带有StructType的意外元组'h'.

ValueError: Unexpected tuple 'h' with StructType.

现在我很想知道的部分是,如果我将元素 MyIds 更改为 myIds (即首字母小写),则代码可以正常工作,而我的新数据框(附加)具有单行数据.

Now the part I am curious about is if I change the element MyIds to myIds (ie. lower case the first letter), the code works, and my new dataframe (appended) has the single row of data.

此例外是什么意思&为什么在更改元素大小写时它消失了?

What is this exception mean & why does it go away when I change the case of my element?

(仅供参考,我们的databricks运行时环境为Scala 2.11)

(FYI, our databricks runtime environment is Scala 2.11)

谢谢.

推荐答案

问题应该出在

行可用于通过使用命名参数来创建行对象,这些字段将按名称排序.

Row can be used to create a row object by using named arguments, the fields will be sorted by names.

myschema_xb 中,这三列是按 [_ xmlns,_Version,MyIds] 的顺序定义的.当使用键(_ xmlns,_Version,MyIds)定义myRow时,生成的实际Row对象将是:

In myschema_xb, the three columns are defined in the order [_xmlns, _Version, MyIds]. When you define myRow with the keys: (_xmlns, _Version, MyIds), the actual Row object generated will be:

Row(MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)], _Version=12.3, _xmlns='http://some.where.com')

其中 MyIds 移至第一列,这与架构不匹配,因此会产生ERROR.当您使用小写的列名 myIds 时,Row对象中的键按具有 ['_ Version','_ xmlns','myIds'] 进行排序右侧列中的myIds ,但是切换了 _Version _xmls .这不会产生错误,因为简单的数据类型可以通过类型转换进行传递,但是结果数据帧不正确.

Which has MyIds moved to the first column and this does not match the schema and thus yields ERROR. While when you use lowercase column-name myIds, the keys in Row object are sorted as ['_Version', '_xmlns', 'myIds'] which had myIds in the right column, but _Version and _xmls switched. This does not yield error since simple datatype can pass through the typecasting, but the resulting dataframe is incorrect.

要解决此问题,您应该设置一个类似Row的类并自定义键的顺序,以确保字段的顺序与模式中显示的顺序完全匹配:

To overcome this issue, you should set up a Row-like class and customize the order of keys to make sure the order of fields matches exactly with those shown in your schema:

from pyspark.sql import Row

MyOuterROW = Row('_xmlns', '_Version', 'MyIds')
MyInnerRow = Row('_ID', '_ID_Context', '_Type')

myRow = MyOuterROW( 
    "http://some.where.com", 
    12.3, 
    [ 
        MyInnerROW("XY", "Exxwhy", 9), 
        MyInnerROW("9152", "LNUMB", 21) 
    ] 
)              
print(myRow)
#Row(_xmlns='http://some.where.com', _Version=12.3, MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)])

rdf = spark.createDataFrame([myRow], schema=myschema_xb)

这篇关于创建特定于数据帧的模式:以大写字母开头的StructField的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆