使用 pyspark 从元组列表创建 DataFrame [英] Create DataFrame from list of tuples using pyspark

查看:34
本文介绍了使用 pyspark 从元组列表创建 DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 simple-salesforce 包处理从 SFDC 中提取的数据.我使用 Python3 编写脚本和 Spark 1.5.2.

我创建了一个包含以下数据的 rdd:

[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')][('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')][('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]...

这个数据在名为 v_rdd 的 RDD 中

我的架构如下所示:

StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true)))

我正在尝试用这个 RDD 创建 DataFrame:

sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema)

我打印我的数据帧:

sqlDataFrame.printSchema()

并得到以下内容:

+--------------------+--------------------+--------------------+|编号|包装尺寸|姓名|+--------------------+--------------------+------------------+|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...||[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...||[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|

我希望看到实际数据,如下所示:

+------------------+------------------+--------------------+|编号|包装尺寸|姓名|+-----------------+------------------+--------------------+|a0w1a0000003xB1A |1.0|一个 ||a0w1a0000003xAAI |1.0|乙 ||a0w1a00000xB3AAI |30.0|C |

你能帮我找出我在这里做错了什么吗.

我的 Python 脚本很长,我不确定人们浏览它是否方便,所以我只发布了我遇到问题的部分.

提前致谢!

解决方案

嘿,你下次能不能提供一个可行的例子.那会更容易.

RDD 的呈现方式对于创建 DataFrame 来说基本上是奇怪的.这是根据 Spark 文档创建 DF 的方式.

<预><代码>>>>l = [('爱丽丝', 1)]>>>sqlContext.createDataFrame(l).collect()[行(_1=u'爱丽丝',_2=1)]>>>sqlContext.createDataFrame(l, ['name', 'age']).collect()[Row(name=u'Alice', age=1)]

所以关于你的例子,你可以像这样创建你想要的输出:

# 您目前的数据数据 = sc.parallelize([[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]])# 转换为元组data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))# 定义模式架构 = 结构类型([StructField("Id", StringType(), True),StructField("Packsize", StringType(), True),StructField("名称", StringType(), True)])# 创建数据框DF = sqlContext.createDataFrame(data_converted, schema)# 输出DF.show()+----------------+--------+----+|编号|包装尺寸|名称|+----------------+--------+----+|a0w1a0000003xB1A|1.0|A||a0w1a0000003xAAI|1.0|乙||a0w1a00000xB3AAI|30.0|C|+----------------+--------+----+

希望能帮到你

I am working with data extracted from SFDC using simple-salesforce package. I am using Python3 for scripting and Spark 1.5.2.

I created an rdd containing the following data:

[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')]
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')]
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
...

This data is in RDD called v_rdd

My schema looks like this:

StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true)))

I am trying to create DataFrame out of this RDD:

sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema)

I print my DataFrame:

sqlDataFrame.printSchema()

And get the following:

+--------------------+--------------------+--------------------+
|                  Id|  PackSize|                          Name|
+--------------------+--------------------+--------------------+
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|

I am expecting to see actual data, like this:

+------------------+------------------+--------------------+
|                Id|PackSize|                          Name|
+------------------+------------------+--------------------+
|a0w1a0000003xB1A  |               1.0|       A            |
|a0w1a0000003xAAI  |               1.0|       B            |
|a0w1a00000xB3AAI  |              30.0|       C            |

Can you please help me identify what I am doing wrong here.

My Python script is long, I am not sure it would be convenient for people to sift through it, so I posted only parts I am having issue with.

Thank a ton in advance!

解决方案

Hey could you next time provide a working example. That would be easier.

The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.

>>> l = [('Alice', 1)]
>>> sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

So concerning your example you can create your desired output like this way:

# Your data at the moment
data = sc.parallelize([ 
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
    ])
# Convert to tuple
data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))

# Define schema
schema = StructType([
    StructField("Id", StringType(), True),
    StructField("Packsize", StringType(), True),
    StructField("Name", StringType(), True)
])

# Create dataframe
DF = sqlContext.createDataFrame(data_converted, schema)

# Output
DF.show()
+----------------+--------+----+
|              Id|Packsize|Name|
+----------------+--------+----+
|a0w1a0000003xB1A|     1.0|   A|
|a0w1a0000003xAAI|     1.0|   B|
|a0w1a00000xB3AAI|    30.0|   C|
+----------------+--------+----+

Hope this helps

这篇关于使用 pyspark 从元组列表创建 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆